Text Area Identification in Web Images

Stavros J. Perantonis¹⁸,
Basilios Gatos¹⁸,
Vassilios Maragos^18,20,
Vangelis Karkaletsis¹⁹ &
…
George Petasis¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3025))

Included in the following conference series:

Hellenic Conference on Artificial Intelligence

1433 Accesses

Abstract

With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Survey on Text Information Extraction from Born-Digital and Scene Text Images

Article 18 January 2018

Applying the Image Outline Creation Procedure for Extracting the Visual Content of Images

A Survey on Text Detection from Document Images

References

Antonacopoulos, A., Karatzas, D., Ortiz Lopez, J.: Accessing Textual Information Embedded in Internet Images. In: SPIE Internet Imaging II, San Jose, USA, pp. 198–205 (2001)
Google Scholar
Lopresti, D., Zhou, J.: Document Analysis and the World Wide Web. In: Workshop on Document Analysis Systems, Marven, Pennsylvania, pp. 417–424 (1996)
Google Scholar
Jain, A.K., Yu, B.: Automatic Text Location in Images and Video Frames. Pattern Recognition 31(12), 2055–2076 (1998)
Article Google Scholar
Huang, Q., Dom, B., Steele, D., Ashley, J., Niblack, W.: Foreground/background segmentation of color images by integration of multiple cues. In: Computer Vision and Pattern Recognition, pp. 246–249 (1995)
Google Scholar
Li, H., Kia, O., Doermann, D.: Text enhancement in digital video. In: Doc. Recognition & Retrieval VI (IS&SPIE Electronic Imaging 1999), San Jose, vol. 3651, pp. 2–9 (1999)
Google Scholar
Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image and Vision Computing 16, 879–896 (1998)
Article Google Scholar
Antonacopoulos, A., Karatzas, D.: Text Extraction from Web Images Based on Human Perception and Fuzzy Inference. In: 1st Int’l Workshop on Web Document Analysis (WDA 2001), Seattle, USA, pp. 35–38 (2001)
Google Scholar
Antonacopoulos, A., Karatzas, D.: An Anthropocentric Approach to Text Extraction from WWW Images. In: 4th IAPR Workshop on Document Analysis Systems (DAS 2000), Rio de Janeiro, pp. 515–526 (2000)
Google Scholar
Antonacopoulos, A., Delporte, F.: Automated Interpretation of Visual representations: Extracting textual Information from WWW Images. In: Paton, R., Neilson, I. (eds.) Visual Representations and Interpretations, Springer, London (1999)
Google Scholar
Lopresti, D., Zhou, J.: Locating and Recognizing Text in WWW Images. Information Retrieval 2(2/3), 177–206 (2000)
Article Google Scholar
Perantonis, S.J., Gatos, B., Maragos, V.: A Novel Web Image Processing Algorithm for Text Area Identification that Helps Commercial OCR Engines to Improve Their Web Image Recognition Efficiency. In: Second International Workshop on Web Document Analysis (WDA 2003), Edinburgh, Scotland (2003)
Google Scholar
Antonacopoulos, A., Gatos, B., Karatzas, D.: ICDAR 2003 Page Segmentation Competition. In: 7th International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland (2003)
Google Scholar
Petasis, G., Karkaletsis, V., Spyropoulos, C.D.: Cross-lingual Information Extraction from Web pages: the use of a general-purpose Text Engineering Platform. In: 4th International Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

National Research Center ”Demokritos”, Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, 153 10, Athens, Greece
Stavros J. Perantonis, Basilios Gatos & Vassilios Maragos
National Research Center ”Demokritos”, Software and Knowledge Engineering, Institute of Informatics and Telecommunications, 153 10, Athens, Greece
Vangelis Karkaletsis & George Petasis
Department of Computer Science, Technological Educational Institution of Athens, 122 10, Egaleo, Greece
Vassilios Maragos

Authors

Stavros J. Perantonis
View author publications
You can also search for this author in PubMed Google Scholar
Basilios Gatos
View author publications
You can also search for this author in PubMed Google Scholar
Vassilios Maragos
View author publications
You can also search for this author in PubMed Google Scholar
Vangelis Karkaletsis
View author publications
You can also search for this author in PubMed Google Scholar
George Petasis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Info and Communication Systems Eng, Aegean University, 83200, Karlovassi, Samos, Greece
George A. Vouros
Department of Informatics, University of Piraeus, Piraeus, Greece
Themistoklis Panayiotopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perantonis, S.J., Gatos, B., Maragos, V., Karkaletsis, V., Petasis, G. (2004). Text Area Identification in Web Images. In: Vouros, G.A., Panayiotopoulos, T. (eds) Methods and Applications of Artificial Intelligence. SETN 2004. Lecture Notes in Computer Science(), vol 3025. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24674-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-24674-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21937-8
Online ISBN: 978-3-540-24674-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics