short-paper

Glyph Miner: A System for Efficiently Extracting Glyphs from Early Prints in the Context of OCR

Authors:

Benedikt Budig,

Thomas C. van Dijk,

Felix KirchnerAuthors Info & Claims

JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

Pages 31 - 34

https://doi.org/10.1145/2910896.2910915

Published: 19 June 2016 Publication History

Get Access

Abstract

While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be "trained" specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.

References

[1]

L. W. G. Barton, J. A. Caldwell, and P. G. Jeavons. E-Library of Medieval Chant Manuscript Transcrip- tions. In Proc. JCDL'05, pages 320--329, 2005.

Digital Library

Google Scholar

[2]

M. Behr. Buchdruck und Sprachwandel. De Gruyter, 2014.

Crossref

Google Scholar

[3]

B. Budig and T. C. van Dijk. Active Learning for Classifying Template Matches in Historical Maps. In Proc.\ DS'14, pages 33--47, 2015.

Google Scholar

[4]

U. Caluori and K. Simon. An OCR Concept for Histo- ric Prints. In Archiving Conf., pages 143--147, 2013.

Google Scholar

[5]

C. Clausner, S. Pletschacher, and A. Antonacopoulos. Aletheia -- An Advanced Document Layout and Text Ground-Truthing System for Production Environ- ments. In ICDAR'11, pages 48--52, 2011.

Digital Library

Google Scholar

[6]

C. Clausner, S. Pletschacher, and A. Antonacopoulos. Efficient OCR Training Data Generation with Ale- theia. In Short Paper Booklet of the 11th IAPR Workshop DAS'14, pages 19--20, 2014.

Google Scholar

[7]

C. Dalitz and R. Baston. Optical Character Recogni- tion with the Gamera Framework. In Doc.\ Image Ana- lysis with the Gamera Framework, pages 53--65, 2009.

Google Scholar

[8]

M. P. Deseilligny, H. Le Men, and G. Stamon. Charac- ter String Recognition on Maps, a Rotation-invariant Recognition Method. Pattern Recognition Letters, 16(12):1297--1310, 1995.

Digital Library

Google Scholar

[9]

M. Droettboom, I. Fujinaga, K. MacMillan, G. S. Chouhury, T. DiLauro, M. Patton, and T. Anderson. Using the Gamera Framework for the Recognition of Cultural Heritage Materials. In Proc.\ JCDL'02, pages 11--17, 2002.

Digital Library

Google Scholar

[10]

M. Heli\'nski, M. Kmieciak, and T. Parkoła. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. Improving Access to Text, 2012.

Google Scholar

[11]

S. Pletschacher and A. Antonacopoulos. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In ICPR'10, pages 257--260, 2010.

Digital Library

Google Scholar

[12]

S. Pletschacher, C. Clausner, and A. Antonacopoulos. Europeana Newspapers OCR Workflow Evaluation. In Proc.\ HIP'15, pages 39--56, 2015.

Digital Library

Google Scholar

[13]

R. Smith. An Overview of the Tesseract OCR Engine. In Proc.\ ICDAR'07, pages 629--633, 2007.

Digital Library

Google Scholar

[14]

K. Torabi, J. Durgan, and B. Tarpley. Early Modern OCR Project (eMOP) at Texas A&M University: Using Aletheia to Train Tesseract. In Proc.\ DocEng'13, pages 23--26, 2013.

Digital Library

Google Scholar

Cited By

View all

Kordon FWeichselbaumer NHerz Rvan der Loop JMossman SPotten ESeuret MMayr MWu FChristlein V(2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 11-Sep-2024
https://doi.org/10.1007/978-3-031-70442-0_23
Kordon FWeichselbaumer NHerz RMossman SPotten ESeuret MMayr MChristlein V(2023)Classification of incunable glyphs and out-of-distribution detection with joint energy-based modelsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-023-00442-x26:3(223-240)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s10032-023-00442-x
Yin YWang YJiang YFan SXiong JGui G(2019)The Image Preprocessing and Check of Amount for VAT InvoicesCommunications, Signal Processing, and Systems10.1007/978-981-13-6504-1_6(44-51)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-981-13-6504-1_6

Index Terms

Glyph Miner: A System for Efficiently Extracting Glyphs from Early Prints in the Context of OCR
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Retrieval tasks and goals
      1. Information extraction

Recommendations

Glyph extraction from historic document images
DocEng '10: Proceedings of the 10th ACM symposium on Document engineering

This paper is about the reproduction of ancient texts with vectorised fonts. While for OCR only recognition rates count, a reproduction process does not necessarily require the recognition of characters. Our system aims at extracting all characters from ...
Toward automatic development of handwritten personal Farsi/Arabic OpenType$$^{\textregistered }$$® fonts

The interest in personalized handwritten fonts has been increased in recent years. This paper concerns with the automatic generation of Farsi/Arabic handwritten fonts. To reach this target, we need to extract the properties of the writer's script style. ...
Character and numeral recognition for non-Indic and Indic scripts: a survey
Abstract
A collection of different scripts is employed in writing languages throughout the world. Character and numeral recognition of a particular script is a key area in the field of pattern recognition. In this paper, we have presented a comprehensive ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

June 2016

316 pages

ISBN:9781450342292

DOI:10.1145/2910896

General Chairs:
Nabil R. Adam
Rutgers University
,
Boots Cassel
Villanova University
,
Yelena Yesha
University of Maryland, Baltimore County
,
Program Chairs:
Richard Furuta
Texas A&M University
,
Michele C. Weigle
Old Dominion University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Bundesministerium für Bildung und Forschung

Conference

JCDL '16

Sponsor:

JCDL '16: The 16th ACM/IEEE-CS Joint Conference on Digital Libraries

June 19 - 23, 2016

New Jersey, Newark, USA

Acceptance Rates

JCDL '16 Paper Acceptance Rate 15 of 52 submissions, 29%;

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24

Sponsor:
sigir
sigir

The 2024 ACM/IEEE Joint Conference on Digital Libraries

December 16 - 20, 2024

Hong Kong , China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
123
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kordon FWeichselbaumer NHerz Rvan der Loop JMossman SPotten ESeuret MMayr MWu FChristlein V(2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 11-Sep-2024
https://doi.org/10.1007/978-3-031-70442-0_23
Kordon FWeichselbaumer NHerz RMossman SPotten ESeuret MMayr MChristlein V(2023)Classification of incunable glyphs and out-of-distribution detection with joint energy-based modelsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-023-00442-x26:3(223-240)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s10032-023-00442-x
Yin YWang YJiang YFan SXiong JGui G(2019)The Image Preprocessing and Check of Amount for VAT InvoicesCommunications, Signal Processing, and Systems10.1007/978-981-13-6504-1_6(44-51)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-981-13-6504-1_6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Glyph extraction from historic document images

Toward automatic development of handwritten personal Farsi/Arabic OpenType$$^{\textregistered }$$® fonts

Character and numeral recognition for non-Indic and Indic scripts: a survey