Combining linguistic and pictorial information: Using captions to interpret newspaper photographs

Rohini K. Srihari¹ &
William J. Rapaport¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 437))

Included in the following conference series:

Workshop on Semantic Network Processing Systems

158 Accesses
2 Citations

Abstract

There are many situations where linguistic and pictorial data are jointly presented to communicate information. A computer model for synthesising information from the two sources requires an initial interpretation of both the text and the picture followed by consolidation of information. The problem of performing general-purpose vision (without apriori knowledge) would make this a nearly impossible task. However, in some situations, the text describes salient aspects of the picture. In such situations, it is possible to extract visual information from the text, resulting in a relational graph describing the structure of the accompanying picture. This graph can then be used by a computer vision system to guide the interpretation of the picture. This paper discusses an application whereby information obtained from parsing a caption of a newspaper photograph is used to identify human faces in the photograph. Heuristics are described for extracting information from the caption which contributes to the hypothesised structure of the picture. The top-down processing of the image using this information is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Giovanni Adorni, Mauro Di Manzo, and Fausto Giunchiglia. Natural language Driven Image Generation. In Proceedings of COLING, pages 495–500, 1984.
Google Scholar
Edmund C. Arnold. Modern Newspaper Design. Harper and Row, New York, NY, 1969.
Google Scholar
N. Abe, I. Soga, and S. Tsuji. A Plot Understanding System on Reference to Both Image and Language. In Proceedings of IJCAI, pages 77–84, 1981.
Google Scholar
Venu Govindaraju, David B. Sher, Rohini K. Srihari, and Sargur N. Srihari. Locating human faces in newspaper photographs. In Proceedings of CVPR, pages 549–554, 1989.
Google Scholar
Annette Herskovits. Language and Spatial Cognition. Cambridge University Press, 1986.
Google Scholar
Robert M. Haralick and Linda G. Shapiro. The Consistent Labeling Problem: Part 1. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):173–184, 1979.
Google Scholar
Ray Jackendoff. On Beyond Zebra: The Relation of Linguistic and Visual Information. Cognition, 26(2):89–114, 1987.
Google Scholar
Anthony B. Maddox and James Pustejovsky. Linguistic Descriptions of Visual Event Perceptions. In Proceedings of the Cognitive Science Society Conference, pages 442–454, Seattle, 1987.
Google Scholar
B. Neumann and H. Novak. Event Models for Recognition and Natural Language Description of Events in Real-World Image Sequences. In Proceedings of IJCAI 1983, pages 724–726, 1983.
Google Scholar
Stuart C. Shapiro. Generalized Augmented Transition Network Grammars for Generation from Semantic Networks. The American Journal for Computational Linguistics, 8(2):12–25, 1982.
Google Scholar
Stuart C. Shapiro and William J. Rapaport. SNePS Considered as a Fully Intensional Propositional Semantic Network. In Nick Cercone and Gordon McCalla, editors, The Knowledge Frontier, Essays in the Representation of Knowledge, pages 262–315. Springer-Verlag, New York, 1987.
Google Scholar
Rohini K. Srihari and William J. Rapaport. Extracting Visual Information From Text: Using Captions to Label Human Faces in Newspaper Photographs. In Proceedings of the 11th Annual Conference of the Cognitive Society, pages 364–371. Lawrence Erlbaum Associates, 1989.
Google Scholar
Rohini K. Srihari. Combining Path-based and Node-based Reasoning in SNePS. Technical Report 183, SUNY at Buffalo, 1981.
Google Scholar
David L. Waltz and L. Boggess. Visual Analog Representation for Natural Language Understanding. In Proceedings of IJCAI, pages 926–934, 1979.
Google Scholar
T.E. Weymouth. Using Object Descriptions in a Schema Network for Machine Vision. PhD thesis, University of Masschusetts at Amherst, 1986.
Google Scholar
Masao Yokota, Rin-ichiro Taniguchi, and Eiji Kawaguchi. Language-Picture Question-Answering Through Common Semantic Representation and its Application to the World of Weather Report. In Leonard Bolc, editor, Natural Language Communication with Pictorial Information Systems. Springer-Verlag, 1984.
Google Scholar
Uri Zernik and Barbara J. Vivier. How Near Is Too Far? Talking about Visual Images. In Proceedings of the Tenth Annual Conference of the Cognitive Science Society, pages 202–208. Lawrence Erlbaum Associates, 1988.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, State University of New York at Buffalo, 14260, Buffalo, New York, USA
Rohini K. Srihari & William J. Rapaport

Authors

Rohini K. Srihari
View author publications
You can also search for this author in PubMed Google Scholar
William J. Rapaport
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

D. Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srihari, R.K., Rapaport, W.J. (1990). Combining linguistic and pictorial information: Using captions to interpret newspaper photographs. In: Kumar, D. (eds) Current Trends in SNePS — Semantic Network Processing System. SNePS 1989. Lecture Notes in Computer Science, vol 437. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022085

Download citation

DOI: https://doi.org/10.1007/BFb0022085
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-52626-1
Online ISBN: 978-3-540-47081-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics