Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2910896.2910915acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
short-paper

Glyph Miner: A System for Efficiently Extracting Glyphs from Early Prints in the Context of OCR

Published: 19 June 2016 Publication History

Abstract

While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be "trained" specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.

References

[1]
L. W. G. Barton, J. A. Caldwell, and P. G. Jeavons. E-Library of Medieval Chant Manuscript Transcrip- tions. In Proc. JCDL'05, pages 320--329, 2005.
[2]
M. Behr. Buchdruck und Sprachwandel. De Gruyter, 2014.
[3]
B. Budig and T. C. van Dijk. Active Learning for Classifying Template Matches in Historical Maps. In Proc.\ DS'14, pages 33--47, 2015.
[4]
U. Caluori and K. Simon. An OCR Concept for Histo- ric Prints. In Archiving Conf., pages 143--147, 2013.
[5]
C. Clausner, S. Pletschacher, and A. Antonacopoulos. Aletheia -- An Advanced Document Layout and Text Ground-Truthing System for Production Environ- ments. In ICDAR'11, pages 48--52, 2011.
[6]
C. Clausner, S. Pletschacher, and A. Antonacopoulos. Efficient OCR Training Data Generation with Ale- theia. In Short Paper Booklet of the 11th IAPR Workshop DAS'14, pages 19--20, 2014.
[7]
C. Dalitz and R. Baston. Optical Character Recogni- tion with the Gamera Framework. In Doc.\ Image Ana- lysis with the Gamera Framework, pages 53--65, 2009.
[8]
M. P. Deseilligny, H. Le Men, and G. Stamon. Charac- ter String Recognition on Maps, a Rotation-invariant Recognition Method. Pattern Recognition Letters, 16(12):1297--1310, 1995.
[9]
M. Droettboom, I. Fujinaga, K. MacMillan, G. S. Chouhury, T. DiLauro, M. Patton, and T. Anderson. Using the Gamera Framework for the Recognition of Cultural Heritage Materials. In Proc.\ JCDL'02, pages 11--17, 2002.
[10]
M. Heli\'nski, M. Kmieciak, and T. Parkoła. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. Improving Access to Text, 2012.
[11]
S. Pletschacher and A. Antonacopoulos. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In ICPR'10, pages 257--260, 2010.
[12]
S. Pletschacher, C. Clausner, and A. Antonacopoulos. Europeana Newspapers OCR Workflow Evaluation. In Proc.\ HIP'15, pages 39--56, 2015.
[13]
R. Smith. An Overview of the Tesseract OCR Engine. In Proc.\ ICDAR'07, pages 629--633, 2007.
[14]
K. Torabi, J. Durgan, and B. Tarpley. Early Modern OCR Project (eMOP) at Texas A&M University: Using Aletheia to Train Tesseract. In Proc.\ DocEng'13, pages 23--26, 2013.

Cited By

View all
  • (2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 11-Sep-2024
  • (2023)Classification of incunable glyphs and out-of-distribution detection with joint energy-based modelsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-023-00442-x26:3(223-240)Online publication date: 22-Jun-2023
  • (2019)The Image Preprocessing and Check of Amount for VAT InvoicesCommunications, Signal Processing, and Systems10.1007/978-981-13-6504-1_6(44-51)Online publication date: 14-Aug-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
June 2016
316 pages
ISBN:9781450342292
DOI:10.1145/2910896
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document recognition
  2. early prints
  3. efficient user interaction
  4. glyph extraction
  5. ocr

Qualifiers

  • Short-paper

Funding Sources

Conference

JCDL '16
Sponsor:

Acceptance Rates

JCDL '16 Paper Acceptance Rate 15 of 52 submissions, 29%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Upcoming Conference

JCDL '24
The 2024 ACM/IEEE Joint Conference on Digital Libraries
December 16 - 20, 2024
Hong Kong , China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 11-Sep-2024
  • (2023)Classification of incunable glyphs and out-of-distribution detection with joint energy-based modelsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-023-00442-x26:3(223-240)Online publication date: 22-Jun-2023
  • (2019)The Image Preprocessing and Check of Amount for VAT InvoicesCommunications, Signal Processing, and Systems10.1007/978-981-13-6504-1_6(44-51)Online publication date: 14-Aug-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media