Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12917))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1873 Accesses
3 Citations
1 Altmetric

Abstract

Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

A Modular and Automated Annotation Platform for Handwritings: Evaluation on Under-Resourced Languages

Handwritten Kazakh and Russian (HKR) database for text recognition

Article 13 August 2021

Datasets Availability. Datasets produced for this paper and evaluation scripts are available at DOI 10.5281/zenodo.5071963.

References

Bischoff, B.: Paläographie des römischen Altertums und des abendländischen Mittelalters. Grundlagen der Germanistik, 4th edn. E. Schmidt, Berlin (2009)
Google Scholar
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the Himanis project. ICDAR 1, 311–316 (2017)
Google Scholar
BnF: Petrus Pictaviensis, Tractatus de confessione ($\ldots $). Latin 14525. In: Gallica. BnF (1997). https://gallica.bnf.fr/ark:/12148/btv1b9080806r/
Bollmann, M.: A Large-Scale Comparison of Historical Text Normalization Systems. NAACL-HLT pp. 3885–3898. arXiv: 1904.02036 (2019). https://doi.org/10.18653/v1/N19-1389
Camps, J.B.: La ‘Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique. thèse de doctorat, dir. Dominique Boutet, Paris-Sorbonne, Paris (2016). https://doi.org/10.5281/zenodo.1116735
Camps, J.B., Clérice, T., Pinche, A.: Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis, December 2020. arXiv:2012.03845 (2020). http://arxiv.org/abs/2012.03845
Ceccherini, I.: Manuscrits datés (notices complètes). In: Stutzmann, D. (ed.) Github, Paris (2017). https://github.com/oriflamms/Dated-and-Datable-Manuscripts_LIRIS
Clérice, T.: evaluating deep learning methods for word segmentation of Scripta continua texts in Old French and Latin. J. Data Min. Digit. Humanities (2020). https://doi.org/10.46298/jdmdh.5581
Gabay, S., Barrault, L.: Traduction automatique pour la normalisation du français du XVIIe siècle. In: Benzitoun, C., et al. (eds.) TALN 27, vol. 2, pp. 213–222. Nancy (2020). https://hal.archives-ouvertes.fr/hal-02784770
Hasenohr, G.: Abréviations et frontières de mots. Langue française 119, 24–29 (1998). https://doi.org/10.3406/lfr.1998.6257
Article Google Scholar
Hasenohr, G.: Écrire en latin, écrire en roman: réflexions sur la pratique des abréviations dans les manuscrits français des XII$^{\rm e}$ et XIII$^{\rm e}$ siècles. In: Banniard, M. (ed.) Langages et peuples d’Europe: cristallisation des identités romanes et germaniques (VII$^{\rm e}$-XI$^{\rm e}$ siècle), pp. 79–110. Toulouse (2002)
Google Scholar
Kiessling, B.: A modular region and text line layout analysis system. In: ICFHR, pp. 313–318 (2020). https://doi.org/10.1109/ICFHR2020.2020.00064
Kiessling, B., Miller, M.T., Maxim, G., Savant, S.B., et al.: Important new developments in arabographic optical character recognition (OCR). Al-$^{\rm c}$Uṣūr al-Wusṭā 25, 1–13 (2017)
Google Scholar
Kiessling, B., Tissot, R., Stokes, P., Stökl Ben Ezra, D.: eScriptorium: an open source platform for historical document analysis. In: ICDARW, vol. 2, pp. 19–24 (2019)
Google Scholar
Manjavacas, E., Kádár, A., Kestemont, M.: Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939 (2019)
Mazziotta, N.: Traiter les abréviations du français médiéval: théorie de l’écriture et pratiques d’encodage. Corpus 7, 1517 (2008). http://corpus.revues.org/1517
Migne, J.P. (ed.): Patrologiae cursus completus ... Series Latina. Apud Garnieri Fratres, editores et J.-P. Migne successores, Parisiis (1844)
Google Scholar
Muzerelle, D., Bozzolo, C., Coq, D., Ornato, E.: Psautiers IMS. In: D. Stutzmann, D. (ed.) Github, Paris (2018). https://github.com/oriflamms/PsautierIMS
Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs. diplomatic transcripts for historical handwritten text recognition. In: ICIAP, pp. 103–114 (2019)
Google Scholar
Stutzmann, D.: Psautiers: Transcriptions de différents manuscrits. Github, Paris (2018). https://github.com/oriflamms/PsautierIMS
Stutzmann, D.: Recueil des actes de l’abbaye de Fontenay. TELMA, Github, Paris (2018). https://github.com/oriflamms/Fontenay
Vernet, M.: Un Manuscrit victorin au service de la pastorale du XIIIe siècle. Master’s thesis, Université PSL, Paris (2021)
Google Scholar
Vidal-Gorène, C., Decours-Perez, A.: A computational approach of Armenian paleography. In: Accepted for IWCP Workshop of ICDAR 2021 (2021)
Google Scholar
Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In: Accepted for ICDAR 2021 Conference (2021) by In: J. Lladós et al. (eds.) ICDAR 2021, LNCS 12823. Springer (2021). https://doi.org/10.1007/978-3-030-86334-0_33
Villegas, M., Toselli, A.H., Romero, V., Vidal, E.: Exploiting existing modern transcripts for historical handwritten text recognition. In: ICFHR, pp. 66–71 (2016)
Google Scholar
Wang, C.: Fastwer (2020). https://github.com/kahne/fastwer, v0.1.3

Download references

Acknowledgements

We thank the École nationale des chartes and the DIM STCN for the computing power and GPU server used for training, as well as INRIA and Calfa. We also thank Marc H. Smith for his keen review of our draft. Any remaining mistakes are only attributable to us.

Author information

Authors and Affiliations

École Nationale des Chartes – Université Paris, Sciences and Lettres, 65 rue de Richelieu, 75002, Paris, France
Jean-Baptiste Camps, Chahan Vidal-Gorène & Marguerite Vernet

Authors

Jean-Baptiste Camps
View author publications
You can also search for this author in PubMed Google Scholar
Chahan Vidal-Gorène
View author publications
You can also search for this author in PubMed Google Scholar
Marguerite Vernet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Baptiste Camps .

Editor information

Editors and Affiliations

Boise State University, Boise, ID, USA
Elisa H. Barney Smith
Indian Statistical Institute, Kolkata, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Camps, JB., Vidal-Gorène, C., Vernet, M. (2021). Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-86159-9_21
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86158-2
Online ISBN: 978-3-030-86159-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

A Modular and Automated Annotation Platform for Handwritings: Evaluation on Under-Resourced Languages

Handwritten Kazakh and Russian (HKR) database for text recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

A Modular and Automated Annotation Platform for Handwritings: Evaluation on Under-Resourced Languages

Handwritten Kazakh and Russian (HKR) database for text recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation