Abstract
Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Datasets Availability. Datasets produced for this paper and evaluation scripts are available at DOI 10.5281/zenodo.5071963.
References
Bischoff, B.: Paläographie des römischen Altertums und des abendländischen Mittelalters. Grundlagen der Germanistik, 4th edn. E. Schmidt, Berlin (2009)
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the Himanis project. ICDAR 1, 311–316 (2017)
BnF: Petrus Pictaviensis, Tractatus de confessione (\(\ldots \)). Latin 14525. In: Gallica. BnF (1997). https://gallica.bnf.fr/ark:/12148/btv1b9080806r/
Bollmann, M.: A Large-Scale Comparison of Historical Text Normalization Systems. NAACL-HLT pp. 3885–3898. arXiv: 1904.02036 (2019). https://doi.org/10.18653/v1/N19-1389
Camps, J.B.: La ‘Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique. thèse de doctorat, dir. Dominique Boutet, Paris-Sorbonne, Paris (2016). https://doi.org/10.5281/zenodo.1116735
Camps, J.B., Clérice, T., Pinche, A.: Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis, December 2020. arXiv:2012.03845 (2020). http://arxiv.org/abs/2012.03845
Ceccherini, I.: Manuscrits datés (notices complètes). In: Stutzmann, D. (ed.) Github, Paris (2017). https://github.com/oriflamms/Dated-and-Datable-Manuscripts_LIRIS
Clérice, T.: evaluating deep learning methods for word segmentation of Scripta continua texts in Old French and Latin. J. Data Min. Digit. Humanities (2020). https://doi.org/10.46298/jdmdh.5581
Gabay, S., Barrault, L.: Traduction automatique pour la normalisation du français du XVIIe siècle. In: Benzitoun, C., et al. (eds.) TALN 27, vol. 2, pp. 213–222. Nancy (2020). https://hal.archives-ouvertes.fr/hal-02784770
Hasenohr, G.: Abréviations et frontières de mots. Langue française 119, 24–29 (1998). https://doi.org/10.3406/lfr.1998.6257
Hasenohr, G.: Écrire en latin, écrire en roman: réflexions sur la pratique des abréviations dans les manuscrits français des XII\(^{\rm e}\) et XIII\(^{\rm e}\) siècles. In: Banniard, M. (ed.) Langages et peuples d’Europe: cristallisation des identités romanes et germaniques (VII\(^{\rm e}\)-XI\(^{\rm e}\) siècle), pp. 79–110. Toulouse (2002)
Kiessling, B.: A modular region and text line layout analysis system. In: ICFHR, pp. 313–318 (2020). https://doi.org/10.1109/ICFHR2020.2020.00064
Kiessling, B., Miller, M.T., Maxim, G., Savant, S.B., et al.: Important new developments in arabographic optical character recognition (OCR). Al-\(^{\rm c}\)Uṣūr al-Wusṭā 25, 1–13 (2017)
Kiessling, B., Tissot, R., Stokes, P., Stökl Ben Ezra, D.: eScriptorium: an open source platform for historical document analysis. In: ICDARW, vol. 2, pp. 19–24 (2019)
Manjavacas, E., Kádár, A., Kestemont, M.: Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939 (2019)
Mazziotta, N.: Traiter les abréviations du français médiéval: théorie de l’écriture et pratiques d’encodage. Corpus 7, 1517 (2008). http://corpus.revues.org/1517
Migne, J.P. (ed.): Patrologiae cursus completus ... Series Latina. Apud Garnieri Fratres, editores et J.-P. Migne successores, Parisiis (1844)
Muzerelle, D., Bozzolo, C., Coq, D., Ornato, E.: Psautiers IMS. In: D. Stutzmann, D. (ed.) Github, Paris (2018). https://github.com/oriflamms/PsautierIMS
Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs. diplomatic transcripts for historical handwritten text recognition. In: ICIAP, pp. 103–114 (2019)
Stutzmann, D.: Psautiers: Transcriptions de différents manuscrits. Github, Paris (2018). https://github.com/oriflamms/PsautierIMS
Stutzmann, D.: Recueil des actes de l’abbaye de Fontenay. TELMA, Github, Paris (2018). https://github.com/oriflamms/Fontenay
Vernet, M.: Un Manuscrit victorin au service de la pastorale du XIIIe siècle. Master’s thesis, Université PSL, Paris (2021)
Vidal-Gorène, C., Decours-Perez, A.: A computational approach of Armenian paleography. In: Accepted for IWCP Workshop of ICDAR 2021 (2021)
Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In: Accepted for ICDAR 2021 Conference (2021) by In: J. Lladós et al. (eds.) ICDAR 2021, LNCS 12823. Springer (2021). https://doi.org/10.1007/978-3-030-86334-0_33
Villegas, M., Toselli, A.H., Romero, V., Vidal, E.: Exploiting existing modern transcripts for historical handwritten text recognition. In: ICFHR, pp. 66–71 (2016)
Wang, C.: Fastwer (2020). https://github.com/kahne/fastwer, v0.1.3
Acknowledgements
We thank the École nationale des chartes and the DIM STCN for the computing power and GPU server used for training, as well as INRIA and Calfa. We also thank Marc H. Smith for his keen review of our draft. Any remaining mistakes are only attributable to us.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Camps, JB., Vidal-Gorène, C., Vernet, M. (2021). Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-86159-9_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86158-2
Online ISBN: 978-3-030-86159-9
eBook Packages: Computer ScienceComputer Science (R0)