Abstract
There are many situations where linguistic and pictorial data are jointly presented to communicate information. A computer model for synthesising information from the two sources requires an initial interpretation of both the text and the picture followed by consolidation of information. The problem of performing general-purpose vision (without apriori knowledge) would make this a nearly impossible task. However, in some situations, the text describes salient aspects of the picture. In such situations, it is possible to extract visual information from the text, resulting in a relational graph describing the structure of the accompanying picture. This graph can then be used by a computer vision system to guide the interpretation of the picture. This paper discusses an application whereby information obtained from parsing a caption of a newspaper photograph is used to identify human faces in the photograph. Heuristics are described for extracting information from the caption which contributes to the hypothesised structure of the picture. The top-down processing of the image using this information is discussed.
Preview
Unable to display preview. Download preview PDF.
References
Giovanni Adorni, Mauro Di Manzo, and Fausto Giunchiglia. Natural language Driven Image Generation. In Proceedings of COLING, pages 495–500, 1984.
Edmund C. Arnold. Modern Newspaper Design. Harper and Row, New York, NY, 1969.
N. Abe, I. Soga, and S. Tsuji. A Plot Understanding System on Reference to Both Image and Language. In Proceedings of IJCAI, pages 77–84, 1981.
Venu Govindaraju, David B. Sher, Rohini K. Srihari, and Sargur N. Srihari. Locating human faces in newspaper photographs. In Proceedings of CVPR, pages 549–554, 1989.
Annette Herskovits. Language and Spatial Cognition. Cambridge University Press, 1986.
Robert M. Haralick and Linda G. Shapiro. The Consistent Labeling Problem: Part 1. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):173–184, 1979.
Ray Jackendoff. On Beyond Zebra: The Relation of Linguistic and Visual Information. Cognition, 26(2):89–114, 1987.
Anthony B. Maddox and James Pustejovsky. Linguistic Descriptions of Visual Event Perceptions. In Proceedings of the Cognitive Science Society Conference, pages 442–454, Seattle, 1987.
B. Neumann and H. Novak. Event Models for Recognition and Natural Language Description of Events in Real-World Image Sequences. In Proceedings of IJCAI 1983, pages 724–726, 1983.
Stuart C. Shapiro. Generalized Augmented Transition Network Grammars for Generation from Semantic Networks. The American Journal for Computational Linguistics, 8(2):12–25, 1982.
Stuart C. Shapiro and William J. Rapaport. SNePS Considered as a Fully Intensional Propositional Semantic Network. In Nick Cercone and Gordon McCalla, editors, The Knowledge Frontier, Essays in the Representation of Knowledge, pages 262–315. Springer-Verlag, New York, 1987.
Rohini K. Srihari and William J. Rapaport. Extracting Visual Information From Text: Using Captions to Label Human Faces in Newspaper Photographs. In Proceedings of the 11th Annual Conference of the Cognitive Society, pages 364–371. Lawrence Erlbaum Associates, 1989.
Rohini K. Srihari. Combining Path-based and Node-based Reasoning in SNePS. Technical Report 183, SUNY at Buffalo, 1981.
David L. Waltz and L. Boggess. Visual Analog Representation for Natural Language Understanding. In Proceedings of IJCAI, pages 926–934, 1979.
T.E. Weymouth. Using Object Descriptions in a Schema Network for Machine Vision. PhD thesis, University of Masschusetts at Amherst, 1986.
Masao Yokota, Rin-ichiro Taniguchi, and Eiji Kawaguchi. Language-Picture Question-Answering Through Common Semantic Representation and its Application to the World of Weather Report. In Leonard Bolc, editor, Natural Language Communication with Pictorial Information Systems. Springer-Verlag, 1984.
Uri Zernik and Barbara J. Vivier. How Near Is Too Far? Talking about Visual Images. In Proceedings of the Tenth Annual Conference of the Cognitive Science Society, pages 202–208. Lawrence Erlbaum Associates, 1988.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1990 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Srihari, R.K., Rapaport, W.J. (1990). Combining linguistic and pictorial information: Using captions to interpret newspaper photographs. In: Kumar, D. (eds) Current Trends in SNePS — Semantic Network Processing System. SNePS 1989. Lecture Notes in Computer Science, vol 437. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022085
Download citation
DOI: https://doi.org/10.1007/BFb0022085
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-52626-1
Online ISBN: 978-3-540-47081-6
eBook Packages: Springer Book Archive