Computer Science > Computation and Language

arXiv:2303.16256 (cs)

[Submitted on 28 Mar 2023]

Title:Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

Authors:Jan Idziak, Artjoms Šeļa, Michał Woźniak, Albert Leśniak, Joanna Byszuk, Maciej Eder

View PDF

Abstract:The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study provides a working solution that reads the cards, and links their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the Dictionary of the 17th- and 18th-century Polish, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model; (2) a recognition model to decipher the handwritten content, designed as a spatial transformer network (STN) followed by convolutional neural network (RCNN) with a connectionist temporal classification layer (CTC), trained using a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step using constrained Word Beam Search (WBC): the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which outperforms the base RCNN model. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.

Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.16256 [cs.CL]
	(or arXiv:2303.16256v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.16256
Journal reference:	Computational Science ICCS 2021, vol. 1. (LNCS 12742). Springer, pp. 137-150

Submission history

From: Maciej Eder [view email]
[v1] Tue, 28 Mar 2023 19:06:27 UTC (1,574 KB)

Computer Science > Computation and Language

Title:Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators