Nothing Special   »   [go: up one dir, main page]

Skip to main content

IHR-NomDB: The Old Degraded Vietnamese Handwritten Script Archive Database

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12823))

Included in the following conference series:

  • 3641 Accesses

Abstract

This paper introduces a new handwritten database IHR-NomDB, for an old Vietnamese writing system called ChuNom. Over 260 pages of ChuNom were collected from Vietnamese Nom Preservation Foundation to analyze and annotate the bounding boxes manually to generate more than 5000 patches in which containing the images of handwriting texts, the corresponding digital ChuNom characters and its translation in modern Vietnamese script. Along with this handwriting dataset is a new Synthetic Nom String dataset, which consists of 101, 621 images generated using our collected bank of ChuNom sentences. Totally, 13, 254 characters are presented on the two parts of the database, making this the first and largest publicly available database for researching in this old Vietnamese writing script. For the baseline results, we have performed the testing on the validation set of the handwriting dataset using the Convolution Recurrent Neural Network (CRNN) pretrained on the Synthetic Nom String dataset with CTC Loss and achieved \(42.70\%\) accuracy at sentence level and \(82.28\%\) accuracy at character level. The database is available to download at https://morphoboid.labri.fr/ihr-nom.html.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://nomfoundation.org.

  2. 2.

    IHR = Images Handwritten Recognition.

  3. 3.

    http://nomfoundation.org/nom-project/Tale-of-Kieu.

  4. 4.

    http://nomfoundation.org/nom-project/Luc-Van-Tien.

  5. 5.

    http://nomfoundation.org/nom-project/History-of-Greater-Vietnam.

  6. 6.

    http://nomfoundation.org/nom-project/Chinh-Phu-Ngam-Khuc.

  7. 7.

    http://nomfoundation.org/nom-project/Ho-Xuan-Huong.

  8. 8.

    https://chunom.org/shelf/corpus.

  9. 9.

    http://nomfoundation.org/nom-tools/Tu-Dien-Chu-Nom-Dan_Giai.

References

  1. Antonacopoulos, A., Downton, A.: Special issue on the analysis of historical documents. IJDAR 9, 75–77 (2007)

    Article  Google Scholar 

  2. Cam, B.: Nguon Goc Chu Nom, pp. 354–355. Van hoa nguyet san (1960)

    Google Scholar 

  3. Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., Stolz, M.: Ground truth creation for handwriting recognition in historical documents, pp. 3–10, January 2010

    Google Scholar 

  4. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  5. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008)

    Article  Google Scholar 

  6. Grother, P.: NIST special database 19 handprinted forms and characters database (1995)

    Google Scholar 

  7. Guyon, I., Schomaker, L., Plamondon, R., Liberman, M., Janet, S.: UNIPEN project of on-line data exchange and recognizer benchmarks. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No. 94CH3440-5), vol. 2, pp. 29–33 (1994)

    Google Scholar 

  8. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)

    Article  Google Scholar 

  9. Impedovo, S., Wang, P.S.P., Bunke, H.: Automatic Bankcheck Processing, vol. 28. World Scientific, Singapore (1997)

    Book  Google Scholar 

  10. Jaccard, P.: Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 547–579 (1901)

    Google Scholar 

  11. Jin, L., Gao, Y., Liu, G., Li, Y., Ding, K.: SCUT-COUCH2009-a comprehensive online unconstrained Chinese handwriting database and benchmark evaluation. IJDAR 14, 53–64 (2011)

    Article  Google Scholar 

  12. Khuê, N.: Chu Nôm: co so va nang cao, pp. 10–15

    Google Scholar 

  13. Kusetogullari, H., Yavariabdi, A., Cheddad, A., Grahn, H., Hall, J.: Ardis: a swedish historical handwritten digit dataset. Neural Comput. Appl. 32, 1–14 (2019)

    Google Scholar 

  14. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Dokl 10, 707–710 (1966)

    Google Scholar 

  15. Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: CASIA online and offline Chinese handwriting databases, pp. 37–41 (2011)

    Google Scholar 

  16. Liwicki, M., Bunke, H.: IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 956–961 (2005)

    Google Scholar 

  17. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 5, 39–46 (2002)

    Article  Google Scholar 

  18. Matsumoto, K., Fukushima, T., Nakagawa, M.: Collection and analysis of on-line handwritten Japanese character patterns. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 496–500 (2001)

    Google Scholar 

  19. Nguyen, C.K., Nguyen, C.T., Masaki, N.: Tens of thousands of nom character recognition by deep convolution neural networks. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing - HIP 2017, pp. 37–41. ACM Press (2017)

    Google Scholar 

  20. Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H., et al.: IFN/ENIT-database of handwritten Arabic words. In: Proceedings of CIFED, vol. 2, pp. 127–136. Citeseer (2002)

    Google Scholar 

  21. Phan, T.V., Zhu, B., Nakagawa, M.: Collecting handwritten nom character patterns from historical document pages. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 344–348 (2012)

    Google Scholar 

  22. Rath, T., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recognit. (IJDAR) 9(2), 139–152 (2006)

    Google Scholar 

  23. Sanchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: ICDAR 2017 competition on handwritten text recognition on the read dataset. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1383–1388. IEEE (2017)

    Google Scholar 

  24. Sanchez, J.A., Toselli, A.H., Romero, V., Vidal, E.: ICDAR 2015 competition HTRTS: handwritten text recognition on the transcriptorium dataset. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1166–1170. IEEE (2015)

    Google Scholar 

  25. Scius-Bertrand, A., Voegtlin, L., Alberti, M., Fischer, A., Bui, M.: Layout analysis and text column segmentation for historical Vietnamese steles. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, pp. 84–89 (2019)

    Google Scholar 

  26. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition (2015)

    Google Scholar 

  27. Srihari, S.N., Shin, Y.-C., Ramanaprasad, V., Lee, D.-S.: A system to read names and addresses on tax forms. Proc. IEEE 84(7), 1038–1049 (1996)

    Google Scholar 

  28. Su, T., Zhang, T., Guan, D.: HIT-MW dataset for offline Chinese handwritten text recognition (2006)

    Google Scholar 

  29. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27

    Chapter  Google Scholar 

  30. Van Phan, T., Cong Nguyen, K., Nakagawa, M.: A nom historical document recognition system for digital archiving. Int. J. Doc. Anal. Recognit. (IJDAR) 19(1), 49–64 (2016)

    Article  Google Scholar 

  31. Viard-Gaudin, C., Lallican, P.M., Knerr, S., Binter, P.: The IRESTE on/off (IRONOFF) dual handwriting database. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999 (Cat. No. PR00318), pp. 455–458 (1999)

    Google Scholar 

  32. Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 372–388. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_23

    Chapter  Google Scholar 

  33. Zhou, S., Chen, Q., Wang, X.: HIT-OR3C: an opening recognition corpus for Chinese characters, pp. 223–230 (2010)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Vietnamese Nom Preservation Foundation (http://nomfoundation.org) for granting the authorization to access and collect the mentioned data for our analyzing and creating of the database.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Manh Tu Vu , Van Linh Le or Marie Beurton-Aimar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vu, M.T., Le, V.L., Beurton-Aimar, M. (2021). IHR-NomDB: The Old Degraded Vietnamese Handwritten Script Archive Database. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12823. Springer, Cham. https://doi.org/10.1007/978-3-030-86334-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86334-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86333-3

  • Online ISBN: 978-3-030-86334-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics