Abstract
With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Antonacopoulos, A., Karatzas, D., Ortiz Lopez, J.: Accessing Textual Information Embedded in Internet Images. In: SPIE Internet Imaging II, San Jose, USA, pp. 198–205 (2001)
Lopresti, D., Zhou, J.: Document Analysis and the World Wide Web. In: Workshop on Document Analysis Systems, Marven, Pennsylvania, pp. 417–424 (1996)
Jain, A.K., Yu, B.: Automatic Text Location in Images and Video Frames. Pattern Recognition 31(12), 2055–2076 (1998)
Huang, Q., Dom, B., Steele, D., Ashley, J., Niblack, W.: Foreground/background segmentation of color images by integration of multiple cues. In: Computer Vision and Pattern Recognition, pp. 246–249 (1995)
Li, H., Kia, O., Doermann, D.: Text enhancement in digital video. In: Doc. Recognition & Retrieval VI (IS&SPIE Electronic Imaging 1999), San Jose, vol. 3651, pp. 2–9 (1999)
Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image and Vision Computing 16, 879–896 (1998)
Antonacopoulos, A., Karatzas, D.: Text Extraction from Web Images Based on Human Perception and Fuzzy Inference. In: 1st Int’l Workshop on Web Document Analysis (WDA 2001), Seattle, USA, pp. 35–38 (2001)
Antonacopoulos, A., Karatzas, D.: An Anthropocentric Approach to Text Extraction from WWW Images. In: 4th IAPR Workshop on Document Analysis Systems (DAS 2000), Rio de Janeiro, pp. 515–526 (2000)
Antonacopoulos, A., Delporte, F.: Automated Interpretation of Visual representations: Extracting textual Information from WWW Images. In: Paton, R., Neilson, I. (eds.) Visual Representations and Interpretations, Springer, London (1999)
Lopresti, D., Zhou, J.: Locating and Recognizing Text in WWW Images. Information Retrieval 2(2/3), 177–206 (2000)
Perantonis, S.J., Gatos, B., Maragos, V.: A Novel Web Image Processing Algorithm for Text Area Identification that Helps Commercial OCR Engines to Improve Their Web Image Recognition Efficiency. In: Second International Workshop on Web Document Analysis (WDA 2003), Edinburgh, Scotland (2003)
Antonacopoulos, A., Gatos, B., Karatzas, D.: ICDAR 2003 Page Segmentation Competition. In: 7th International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland (2003)
Petasis, G., Karkaletsis, V., Spyropoulos, C.D.: Cross-lingual Information Extraction from Web pages: the use of a general-purpose Text Engineering Platform. In: 4th International Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Perantonis, S.J., Gatos, B., Maragos, V., Karkaletsis, V., Petasis, G. (2004). Text Area Identification in Web Images. In: Vouros, G.A., Panayiotopoulos, T. (eds) Methods and Applications of Artificial Intelligence. SETN 2004. Lecture Notes in Computer Science(), vol 3025. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24674-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-24674-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21937-8
Online ISBN: 978-3-540-24674-9
eBook Packages: Springer Book Archive