Abstract
This paper introduces a new handwritten database IHR-NomDB, for an old Vietnamese writing system called ChuNom. Over 260 pages of ChuNom were collected from Vietnamese Nom Preservation Foundation to analyze and annotate the bounding boxes manually to generate more than 5000 patches in which containing the images of handwriting texts, the corresponding digital ChuNom characters and its translation in modern Vietnamese script. Along with this handwriting dataset is a new Synthetic Nom String dataset, which consists of 101, 621 images generated using our collected bank of ChuNom sentences. Totally, 13, 254 characters are presented on the two parts of the database, making this the first and largest publicly available database for researching in this old Vietnamese writing script. For the baseline results, we have performed the testing on the validation set of the handwriting dataset using the Convolution Recurrent Neural Network (CRNN) pretrained on the Synthetic Nom String dataset with CTC Loss and achieved \(42.70\%\) accuracy at sentence level and \(82.28\%\) accuracy at character level. The database is available to download at https://morphoboid.labri.fr/ihr-nom.html.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
IHR = Images Handwritten Recognition.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Antonacopoulos, A., Downton, A.: Special issue on the analysis of historical documents. IJDAR 9, 75–77 (2007)
Cam, B.: Nguon Goc Chu Nom, pp. 354–355. Van hoa nguyet san (1960)
Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., Stolz, M.: Ground truth creation for handwriting recognition in historical documents, pp. 3–10, January 2010
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008)
Grother, P.: NIST special database 19 handprinted forms and characters database (1995)
Guyon, I., Schomaker, L., Plamondon, R., Liberman, M., Janet, S.: UNIPEN project of on-line data exchange and recognizer benchmarks. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No. 94CH3440-5), vol. 2, pp. 29–33 (1994)
Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)
Impedovo, S., Wang, P.S.P., Bunke, H.: Automatic Bankcheck Processing, vol. 28. World Scientific, Singapore (1997)
Jaccard, P.: Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Jin, L., Gao, Y., Liu, G., Li, Y., Ding, K.: SCUT-COUCH2009-a comprehensive online unconstrained Chinese handwriting database and benchmark evaluation. IJDAR 14, 53–64 (2011)
Khuê, N.: Chu Nôm: co so va nang cao, pp. 10–15
Kusetogullari, H., Yavariabdi, A., Cheddad, A., Grahn, H., Hall, J.: Ardis: a swedish historical handwritten digit dataset. Neural Comput. Appl. 32, 1–14 (2019)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Dokl 10, 707–710 (1966)
Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: CASIA online and offline Chinese handwriting databases, pp. 37–41 (2011)
Liwicki, M., Bunke, H.: IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 956–961 (2005)
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 5, 39–46 (2002)
Matsumoto, K., Fukushima, T., Nakagawa, M.: Collection and analysis of on-line handwritten Japanese character patterns. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 496–500 (2001)
Nguyen, C.K., Nguyen, C.T., Masaki, N.: Tens of thousands of nom character recognition by deep convolution neural networks. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing - HIP 2017, pp. 37–41. ACM Press (2017)
Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H., et al.: IFN/ENIT-database of handwritten Arabic words. In: Proceedings of CIFED, vol. 2, pp. 127–136. Citeseer (2002)
Phan, T.V., Zhu, B., Nakagawa, M.: Collecting handwritten nom character patterns from historical document pages. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 344–348 (2012)
Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. (IJDAR) 9(2), 139–152 (2006)
Sanchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: ICDAR 2017 competition on handwritten text recognition on the read dataset. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1383–1388. IEEE (2017)
Sanchez, J.A., Toselli, A.H., Romero, V., Vidal, E.: ICDAR 2015 competition HTRTS: handwritten text recognition on the transcriptorium dataset. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1166–1170. IEEE (2015)
Scius-Bertrand, A., Voegtlin, L., Alberti, M., Fischer, A., Bui, M.: Layout analysis and text column segmentation for historical Vietnamese steles. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, pp. 84–89 (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition (2015)
Srihari, S.N., Shin, Y.-C., Ramanaprasad, V., Lee, D.-S.: A system to read names and addresses on tax forms. Proc. IEEE 84(7), 1038–1049 (1996)
Su, T., Zhang, T., Guan, D.: HIT-MW dataset for offline Chinese handwritten text recognition (2006)
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27
Van Phan, T., Cong Nguyen, K., Nakagawa, M.: A nom historical document recognition system for digital archiving. Int. J. Doc. Anal. Recognit. (IJDAR) 19(1), 49–64 (2016)
Viard-Gaudin, C., Lallican, P.M., Knerr, S., Binter, P.: The IRESTE on/off (IRONOFF) dual handwriting database. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999 (Cat. No. PR00318), pp. 455–458 (1999)
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 372–388. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_23
Zhou, S., Chen, Q., Wang, X.: HIT-OR3C: an opening recognition corpus for Chinese characters, pp. 223–230 (2010)
Acknowledgements
The authors would like to thank the Vietnamese Nom Preservation Foundation (http://nomfoundation.org) for granting the authorization to access and collect the mentioned data for our analyzing and creating of the database.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Vu, M.T., Le, V.L., Beurton-Aimar, M. (2021). IHR-NomDB: The Old Degraded Vietnamese Handwritten Script Archive Database. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12823. Springer, Cham. https://doi.org/10.1007/978-3-030-86334-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-86334-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86333-3
Online ISBN: 978-3-030-86334-0
eBook Packages: Computer ScienceComputer Science (R0)