Nothing Special   »   [go: up one dir, main page]

Skip to main content

Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 Workshops (ICDAR 2021)

Abstract

Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Datasets Availability. Datasets produced for this paper and evaluation scripts are available at DOI 10.5281/zenodo.5071963.

References

  1. Bischoff, B.: Paläographie des römischen Altertums und des abendländischen Mittelalters. Grundlagen der Germanistik, 4th edn. E. Schmidt, Berlin (2009)

    Google Scholar 

  2. Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the Himanis project. ICDAR 1, 311–316 (2017)

    Google Scholar 

  3. BnF: Petrus Pictaviensis, Tractatus de confessione (\(\ldots \)). Latin 14525. In: Gallica. BnF (1997). https://gallica.bnf.fr/ark:/12148/btv1b9080806r/

  4. Bollmann, M.: A Large-Scale Comparison of Historical Text Normalization Systems. NAACL-HLT pp. 3885–3898. arXiv: 1904.02036 (2019). https://doi.org/10.18653/v1/N19-1389

  5. Camps, J.B.: La ‘Chanson d’Otinel’: édition complète du corpus manuscrit et prolégomènes à l’édition critique. thèse de doctorat, dir. Dominique Boutet, Paris-Sorbonne, Paris (2016). https://doi.org/10.5281/zenodo.1116735

  6. Camps, J.B., Clérice, T., Pinche, A.: Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis, December 2020. arXiv:2012.03845 (2020). http://arxiv.org/abs/2012.03845

  7. Ceccherini, I.: Manuscrits datés (notices complètes). In: Stutzmann, D. (ed.) Github, Paris (2017). https://github.com/oriflamms/Dated-and-Datable-Manuscripts_LIRIS

  8. Clérice, T.: evaluating deep learning methods for word segmentation of Scripta continua texts in Old French and Latin. J. Data Min. Digit. Humanities (2020). https://doi.org/10.46298/jdmdh.5581

  9. Gabay, S., Barrault, L.: Traduction automatique pour la normalisation du français du XVIIe siècle. In: Benzitoun, C., et al. (eds.) TALN 27, vol. 2, pp. 213–222. Nancy (2020). https://hal.archives-ouvertes.fr/hal-02784770

  10. Hasenohr, G.: Abréviations et frontières de mots. Langue française 119, 24–29 (1998). https://doi.org/10.3406/lfr.1998.6257

    Article  Google Scholar 

  11. Hasenohr, G.: Écrire en latin, écrire en roman: réflexions sur la pratique des abréviations dans les manuscrits français des XII\(^{\rm e}\) et XIII\(^{\rm e}\) siècles. In: Banniard, M. (ed.) Langages et peuples d’Europe: cristallisation des identités romanes et germaniques (VII\(^{\rm e}\)-XI\(^{\rm e}\) siècle), pp. 79–110. Toulouse (2002)

    Google Scholar 

  12. Kiessling, B.: A modular region and text line layout analysis system. In: ICFHR, pp. 313–318 (2020). https://doi.org/10.1109/ICFHR2020.2020.00064

  13. Kiessling, B., Miller, M.T., Maxim, G., Savant, S.B., et al.: Important new developments in arabographic optical character recognition (OCR). Al-\(^{\rm c}\)Uṣūr al-Wusṭā 25, 1–13 (2017)

    Google Scholar 

  14. Kiessling, B., Tissot, R., Stokes, P., Stökl Ben Ezra, D.: eScriptorium: an open source platform for historical document analysis. In: ICDARW, vol. 2, pp. 19–24 (2019)

    Google Scholar 

  15. Manjavacas, E., Kádár, A., Kestemont, M.: Improving lemmatization of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939 (2019)

  16. Mazziotta, N.: Traiter les abréviations du français médiéval: théorie de l’écriture et pratiques d’encodage. Corpus 7, 1517 (2008). http://corpus.revues.org/1517

  17. Migne, J.P. (ed.): Patrologiae cursus completus ... Series Latina. Apud Garnieri Fratres, editores et J.-P. Migne successores, Parisiis (1844)

    Google Scholar 

  18. Muzerelle, D., Bozzolo, C., Coq, D., Ornato, E.: Psautiers IMS. In: D. Stutzmann, D. (ed.) Github, Paris (2018). https://github.com/oriflamms/PsautierIMS

  19. Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs. diplomatic transcripts for historical handwritten text recognition. In: ICIAP, pp. 103–114 (2019)

    Google Scholar 

  20. Stutzmann, D.: Psautiers: Transcriptions de différents manuscrits. Github, Paris (2018). https://github.com/oriflamms/PsautierIMS

  21. Stutzmann, D.: Recueil des actes de l’abbaye de Fontenay. TELMA, Github, Paris (2018). https://github.com/oriflamms/Fontenay

  22. Vernet, M.: Un Manuscrit victorin au service de la pastorale du XIIIe siècle. Master’s thesis, Université PSL, Paris (2021)

    Google Scholar 

  23. Vidal-Gorène, C., Decours-Perez, A.: A computational approach of Armenian paleography. In: Accepted for IWCP Workshop of ICDAR 2021 (2021)

    Google Scholar 

  24. Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages. In: Accepted for ICDAR 2021 Conference (2021) by In: J. Lladós et al. (eds.) ICDAR 2021, LNCS 12823. Springer (2021). https://doi.org/10.1007/978-3-030-86334-0_33

  25. Villegas, M., Toselli, A.H., Romero, V., Vidal, E.: Exploiting existing modern transcripts for historical handwritten text recognition. In: ICFHR, pp. 66–71 (2016)

    Google Scholar 

  26. Wang, C.: Fastwer (2020). https://github.com/kahne/fastwer, v0.1.3

Download references

Acknowledgements

We thank the École nationale des chartes and the DIM STCN for the computing power and GPU server used for training, as well as INRIA and Calfa. We also thank Marc H. Smith for his keen review of our draft. Any remaining mistakes are only attributable to us.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean-Baptiste Camps .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Camps, JB., Vidal-Gorène, C., Vernet, M. (2021). Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham. https://doi.org/10.1007/978-3-030-86159-9_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86159-9_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86158-2

  • Online ISBN: 978-3-030-86159-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics