Layout Analysis and Content Classification in Digitized Books

Andrea Corbelli¹⁴,
Lorenzo Baraldi¹⁴,
Fabrizio Balducci¹⁴,
Costantino Grana¹⁴ &
…
Rita Cucchiara¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 701))

Included in the following conference series:

Italian Research Conference on Digital Libraries

464 Accesses
9 Citations

Abstract

Automatic layout analysis has proven to be extremely important in the process of digitization of large amounts of documents. In this paper we present a mixed approach to layout analysis, introducing a SVM-aided layout segmentation process and a classification process based on local and geometrical features. The final output of the automatic analysis algorithm is a complete and structured annotation in JSON format, containing the digitalized text as well as all the references to the illustrations of the input page, and which can be used by visualization interfaces as well as annotation interfaces. We evaluate our algorithm on a large dataset built upon the first volume of the “Enciclopedia Treccani”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

fang: Fast Annotation of Glyphs in Historical Printed Documents

A Graphical Approach to Document Layout Analysis

An Image Based Approach for Content Analysis in Document Collections

References

Antonacopoulos, A., Gatos, B., Karatzas, D.: ICDAR 2003 page segmentation competition. In: ICDAR, p. 688. IEEE (2003)
Google Scholar
Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S., Soda, G.: Automatic document classification and indexing in high-volume applications. Int. J. Doc. Anal. Recogn. 4(2), 69–83 (2001)
Article Google Scholar
Baird, H., Jones, S., Fortune, S.: Image segmentation by shape-directed covers. In: International Conference on Pattern Recognition, vol. 1, pp. 820–825, June 1990
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: ACM International Conference on Multimedia, pp. 1199–1202. ACM (2015)
Google Scholar
Bertini, M., Del Bimbo, A., Serra, G., Torniai, C., Cucchiara, R., Grana, C., Vezzani, R.: Dynamic pictorial ontologies for video digital libraries annotation. In: IEEE MultiMedia Magazine, pp. 42–51. ACM (2009)
Google Scholar
Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified XY trees for document classification. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, pp. 1131–1136. IEEE (2001)
Google Scholar
Chen, K., Yin, F., Liu, C.L.: Hybrid page segmentation with efficient whitespace rectangles extraction and grouping. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 958–962. IEEE (2013)
Google Scholar
Coüasnon, B., Lemaitre, A.: Recognition of tables and forms. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 647–677. Springer, London (2014)
Chapter Google Scholar
Mauro, N., Ferilli, S., Esposito, F.: Learning to Recognize Critical Cells in Document Tables. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds.) IRCDL 2012. CCIS, vol. 354, pp. 105–116. Springer, Heidelberg (2013). doi:10.1007/978-3-642-35834-0_12
Chapter Google Scholar
Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972)
Article MATH Google Scholar
Esposito, F., Malerba, D., Lisi, F.A.: Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2–3), 175–198 (2000)
Article Google Scholar
Grana, C., Serra, G., Manfredi, M., Coppi, D., Cucchiara, R.: Layout analysis and content enrichment of digitized books. Multimed. Tools Appl. 75(7), 3879–3900 (2016)
Article Google Scholar
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive XY cut using bounding boxes of connected components. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 952–955. IEEE (1995)
Google Scholar
Kaur, S., Sharma, D.V.: Table structure identification from document images: a survey. Int. J. Innov. Adv. Comput. Sci. 4, 581–585 (2015)
Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Article Google Scholar
Lazzara, G., Levillain, R., Géraud, T., Jacquelet, Y., Marquegnies, J., Crépin-Leblond, A.: The scribo module of the olena platform: a free software framework for document image analysis. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 252–258. IEEE (2011)
Google Scholar
Liu, Y., Mitra, P., Giles, C.L.: A fast preprocessing method for table boundary detection: narrowing down the sparse lines using solely coordinate information. In: The Eighth IAPR International Workshop on Document Analysis Systems, pp. 431–438. IEEE (2008)
Google Scholar
Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: Detection and segmentation of tables and math-zones from document images. In: Proceedings of the 2006 ACM Symposium on Applied Computing. SAC 2006, pp. 841–846. ACM (2006)
Google Scholar
Mandal, S., Chowdhury, S., Das, A., Chanda, B.: A simple and effective table detection system from document images. Int. J. Doc. Anal. Recogn. (IJDAR) 8(2–3), 172–182 (2006)
Article Google Scholar
Matas, J., Galambos, C., Kittler, J.: Robust detection of lines using the progressive probabilistic Hough transform. Comput. Vis. Image Underst. 78(1), 119–137 (2000). http://dx.doi.org/10.1006/cviu.1999.0831
Article Google Scholar
Phillips, I.T., Chhabra, A.K.: Empirical performance evaluation of graphics recognition systems. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 849–870 (1999)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Smith, R.: An overview of the Tesseract OCR engine. In: International Conference on Document Analysis and Recognition, pp. 629–633. IEEE (2007)
Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria “Enzo Ferrari”, Università degli Studi di Modena e Reggio Emilia, Via Vivarelli 10, 41125, Modena, Modena, Italy
Andrea Corbelli, Lorenzo Baraldi, Fabrizio Balducci, Costantino Grana & Rita Cucchiara

Authors

Andrea Corbelli
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Baraldi
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Balducci
View author publications
You can also search for this author in PubMed Google Scholar
Costantino Grana
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lorenzo Baraldi .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova, Padua, Italy
Maristella Agosti
Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze, Florence, Italy
Marco Bertini
Dipartimento di Informatica, Università degli Studi di Bari, Bari, Italy
Stefano Ferilli
Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze, Florence, Italy
Simone Marinai
Dipartimento dei Beni Culturali, Università degli Studi di Padova, Padua, Italy
Nicola Orio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Corbelli, A., Baraldi, L., Balducci, F., Grana, C., Cucchiara, R. (2017). Layout Analysis and Content Classification in Digitized Books. In: Agosti, M., Bertini, M., Ferilli, S., Marinai, S., Orio, N. (eds) Digital Libraries and Multimedia Archives. IRCDL 2016. Communications in Computer and Information Science, vol 701. Springer, Cham. https://doi.org/10.1007/978-3-319-56300-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-56300-8_14
Published: 08 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56299-5
Online ISBN: 978-3-319-56300-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Layout Analysis and Content Classification in Digitized Books

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

fang: Fast Annotation of Glyphs in Historical Printed Documents

A Graphical Approach to Document Layout Analysis

An Image Based Approach for Content Analysis in Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Layout Analysis and Content Classification in Digitized Books

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

fang: Fast Annotation of Glyphs in Historical Printed Documents

A Graphical Approach to Document Layout Analysis

An Image Based Approach for Content Analysis in Document Collections

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation