Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Resolving vision and language ambiguities together

Published: 01 October 2017 Publication History

Abstract

We perform semantic segmentation and prepositional phrase attachment resolution.Multiple hypotheses are shown to be crucial to improved multiple-module reasoning.Joint reasoning produces more accurate results than any module operating in isolation. We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence I shot an elephant in my pajamas, looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly re-ranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are also shown to be crucial to improved multiple-module reasoning. Our vision and language approach significantly outperforms the Stanford Parser (De Marneffe etal., 2006) by 17.91% (28.69% relative) and 12.83% (25.28% relative) in two different experiments. We also make small improvements over DeepLab-CRF (Chen etal., 2015).

References

[1]
F. Ahmed, D. Tarlow, D. Batra, Optimizing expected intersection-over-union with candidate-constrained crfs, 2015.
[2]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, Vqa: visual question answering, ICCV (2015).
[3]
Bach, K., 2016. Routledge Encyclopedia of philosophy entry. http://online.sfsu.edu/kbach/ambguity.html.
[4]
K. Barnard, M. Johnson, D. Forsyth, Word sense disambiguation with pictures, Association for Computational Linguistics, 2003.
[5]
D. Batra, An efficient message-passing algorithm for the M-Best MAP problem, 2012.
[6]
D. Batra, P. Yadollahpour, A. Guzman-Rivera, G. Shakhnarovich, Diverse M-Best solutions in Markov random fields, 2012.
[7]
Berzak, Y., Barbu, A., Harari, D., Katz, B., Ullman, S., 2016. Do you see what I mean? visual resolution of linguistic ambiguities. arXiv preprint arXiv:1603.08079.
[8]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, 2015.
[9]
Davis, E., 2016. Notes on ambiguity. http://cs.nyu.edu/faculty/davise/ai/ambiguity.html.
[10]
M.-C. De Marneffe, B. MacCartney, C.D. Manning, Generating typed dependency parses from phrase structure parses, 2006.
[11]
M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, IJCV, 88 (2010) 303-338.
[12]
H. Fang, S. Gupta, F.N. Iandola, R. Srivastava, L. Deng, P. Dollr, J. Gao, X. He, M. Mitchell, J.C. Platt, C.L. Zitnick, G. Zweig, From captions to visual concepts and back, 2015.
[13]
S. Fidler, A. Sharma, R. Urtasun, A sentence is worth a thousand pixels, 2013.
[14]
Gella, S., Lapata, M., Keller, F., 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. arXiv preprint arXiv:1603.09188.
[15]
D. Geman, S. Geman, N. Hallonquist, L. Younes, A visual turing test for computer vision systems, 2014.
[16]
K. Gimpel, D. Batra, C. Dyer, G. Shakhnarovich, A systematic exploration of diversity in machine translation, 2013.
[17]
A. Guzman-Rivera, P. Kohli, D. Batra, Divmcuts: faster training of structural SVMs with diverse m-best cutting-planes., 2013.
[18]
G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, 2008.
[19]
L. Huang, D. Chiang, Better k-best parsing, 2005.
[20]
J.H. Kappes, B. Andres, F.A. Hamprecht, C. Schnrr, S. Nowozin, D. Batra, S. Kim, B.X. Kausler, J. Lellmann, N. Komodakis, C. Rother, A comparative study of modern inference techniques for discrete energy minimization problems, 2013.
[21]
C. Kong, D. Lin, M. Bansal, R. Urtasun, S. Fidler, What are you talking about? text-to-image coreference, 2014.
[22]
T. Lan, W. Yang, Y. Wang, G. Mori, Image retrieval with structured object queries using latent ranking SVM, 2012.
[23]
Malinowski, M., Fritz, M., 2014. A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv preprint arXiv:1411.5190.
[24]
M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons: a neural-based approach to answering questions about images, 2015.
[25]
T. Meltzer, C. Yanover, Y. Weiss, Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation, 2005.
[26]
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013.
[27]
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, A. Yuille, The role of context for object detection and semantic segmentation in the wild, 2014.
[28]
M. Poesio, R. Artstein, Annotating (anaphoric) ambiguity, 2005.
[29]
A. Prasad, S. Jegelka, D. Batra, Submodular meets structured: finding diverse subsets in exponentially-large structured item sets, 2014.
[30]
V. Premachandran, D. Tarlow, D. Batra, Empirical minimum Bayes risk prediction: how to extract an extra few% performance from vision models with just three more parameters, 2014.
[31]
C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting image annotations using Amazons mechanical turk, 2010.
[32]
A. Ratnaparkhi, J. Reynar, S. Roukos, A maximum entropy model for prepositional phrase attachment, 1994.
[33]
Q. Sun, A. Laddha, D. Batra, Active learning for structured probabilistic models with histogram approximation, 2015.
[34]
R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, C. Rother, A comparative study of energy minimization methods for Markov random fields with smoothness-based priors, PAMI, 30 (2008) 1068-1080.
[35]
R. Vedantam, C.L. Zitnick, D. Parikh, Cider: consensus-based image description evaluation, 2014.
[36]
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: aneural image caption generator, 2015.
[37]
P. Yadollahpour, D. Batra, G. Shakhnarovich, Discriminative re-ranking of diverse segmentations, 2013.
[38]
M. Yatskar, M. Galley, L. Vanderwende, L. Zettlemoyer, See no evil, say no evil: description generation from densely labeled images, 2014.
[39]
L. Yu, E. Park, A.C. Berg, T.L. Berg, Visual Madlibs: fill in the blank description generation and question answering, 2015.
[40]
C.L. Zitnick, D. Parikh, Bringing semantics into focus using visual abstraction, 2013.

Cited By

View all
  • (2024)Extraction and Analysis of Semantic Features of English Texts under Intelligent AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162401011558:1(109-115)Online publication date: 1-Feb-2024
  • (2021)A novel automatic image caption generation using bidirectional long-short term memory frameworkMultimedia Tools and Applications10.1007/s11042-021-10632-680:17(25557-25582)Online publication date: 1-Jul-2021
  • (2020)Grounded language interpretation of robotic commands through structured learningArtificial Intelligence10.1016/j.artint.2019.103181278:COnline publication date: 1-Jan-2020
  1. Resolving vision and language ambiguities together

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Computer Vision and Image Understanding
    Computer Vision and Image Understanding  Volume 163, Issue C
    October 2017
    89 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 01 October 2017

    Author Tags

    1. Prepositional phrase ambiguity resolution
    2. Semantic segmentation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Extraction and Analysis of Semantic Features of English Texts under Intelligent AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162401011558:1(109-115)Online publication date: 1-Feb-2024
    • (2021)A novel automatic image caption generation using bidirectional long-short term memory frameworkMultimedia Tools and Applications10.1007/s11042-021-10632-680:17(25557-25582)Online publication date: 1-Jul-2021
    • (2020)Grounded language interpretation of robotic commands through structured learningArtificial Intelligence10.1016/j.artint.2019.103181278:COnline publication date: 1-Jan-2020

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media