Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3476887.3476892acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

Digital Peter: New Dataset, Competition and Handwriting Recognition Methods

Published: 31 October 2021 Publication History

Abstract

This paper presents a new dataset of Peter the Great’s manuscripts and describes a segmentation procedure that converts initial images of documents into lines. This new dataset may be useful for researchers to train handwriting text recognition models as a benchmark when comparing different models. It consists of 9694 images and text files corresponding to different lines in historical documents. The open machine learning competition ”Digital Peter” was held based on the considered dataset. The baseline solution for this competition and advanced methods on handwritten text recognition are described in the article. The full dataset and all codes are publicly available.

References

[1]
J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4714–4722. https://doi.org/10.1109/ICCV.2019.00481
[2]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’18). ACM, 71–79. https://doi.org/10.1145/3219819.3219861
[3]
Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. 2020. Albumentations: Fast and Flexible Image Augmentations. Information 11, 2 (2020), 125. https://doi.org/10.3390/info11020125
[4]
Tim Causer, Kris Grint, Anna-Maria Sichani, and Melissa Terras. 2018. ‘Making such bargain’: Transcribe Bentham and the quality and cost-effectiveness of crowdsourced transcription. Digital Scholarship in the Humanities 33, 3 (2018), 467–487. https://doi.org/10.1093/llc/fqx064
[5]
P. Gayathri and S. Ayyappan. 2014. Off-line handwritten character recognition using Hidden Markov Model. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). 518–523. https://doi.org/10.1109/ICACCI.2014.6968488
[6]
Github. 2020. CTC Beam Search Decoding for PyTorch. https://github.com/parlance/ctcdecode
[7]
Github. 2020. Data and description. https://github.com/MarkPotanin/DigitalPeter
[8]
Github. 2021. Link for comparison. https://github.com/shonenkov/Digital-Peter-Model-Comparisons
[9]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd international conference on Machine learning - ICML '06. ACM Press, 369–376. https://doi.org/10.1145/1143844.1143891
[10]
Tobias Grüning, Roger Labahn, Markus Diem, Florian Kleber, and Stefan Fiel. 2018. READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 351–356. https://doi.org/10.1109/das.2018.38
[11]
Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3(2019), 285–302. https://doi.org/10.1007/s10032-019-00332-1
[12]
Awni Hannun. 2017. Sequence Modeling with CTC. Distill 2, 11 (2017), e8. https://doi.org/10.23915/distill.00008
[13]
Awni Y. Hannun 2014. Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567(2014).
[14]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2961–2969. https://doi.org/10.1109/iccv.2017.322
[15]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc.https://dl.acm.org/doi/10.5555/2969442.2969465
[16]
Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2231–2239. https://doi.org/10.1109/CVPR.2016.245
[17]
Gundram Leifert, Roger Labahn, and Tobias Strauß. 2014. CITlab ARGUS for Arabic handwriting. arXiv preprint arXiv:1412.6061(2014).
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014. Springer International Publishing, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
[19]
Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 43.1–43.13. https://doi.org/10.5244/C.30.43
[20]
O. N. Lyashevskaya, T. O. Shavrina, I. V. Trofimov, N. A. Vlasova, 2020. GRAMEVAL 2020 SHARED TASK: RUSSIAN FULL MORPHOLOGY AND UNIVERSAL DEPENDENCIES PARSING. In Computational Linguistics and Intellectual Technologies, Vol. 20. Russian State University for the Humanities. https://doi.org/10.28995/2075-7182-2020-19-553-569
[21]
U.-V. Marti and H. Bunke. 2002. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5, 1(2002), 39–46. https://doi.org/10.1007/s100320200071
[22]
Michael Murdock, Shawn Reid, Blaine Hamilton, and Jackson Reese. 2015. ICDAR 2015 competition on text line detection in historical documents. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1171–1175. https://doi.org/10.1109/icdar.2015.7333945
[23]
Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev, Anel Alimova, and Abdelrahman Abdallah. 2020. HKR For Handwritten Kazakh & Russian Database. arXiv preprint arXiv:2007.03579(2020).
[24]
Abhishek Prusty, Sowmya Aitha, Abhishek Trivedi, and Ravi Kiran Sarvadevabhatla. 2019. Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 999–1006. https://doi.org/10.1109/icdar.2019.00164
[25]
M. Rajnoha, R. Burget, and M. K. Dutta. 2017. Offline handwritten text recognition using support vector machines. In 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). 132–136. https://doi.org/10.1109/SPIN.2017.8049930
[26]
Guillaume Renton, Yann Soullard, Clément Chatelain, Sébastien Adam, Christopher Kermorvant, and Thierry Paquet. 2018. Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR) 21, 3(2018), 177–186. https://doi.org/10.1007/s10032-018-0304-3
[27]
J. Ryu, H. I. Koo, and N. I. Cho. 2014. Language-Independent Text-Line Extraction Algorithm for Handwritten Documents. IEEE Signal Processing Letters 21, 9 (2014), 1115–1119. https://doi.org/10.1109/LSP.2014.2325940
[28]
Sber. 2020. Link for competition. https://ods.ai/tracks/aij2020/competitions/aij-petr
[29]
Boris Sekachev, Nikita Manovich, Maxim Zhiltsov, Andrey Zhavoronkov, Dmitry Kalinin, Ben Hoff, TOsmanov, Dmitry Kruchinin, Artyom Zankevich, DmitriySidnev, Maksim Markelov, Johannes222, Mathis Chenuet, A-Andre, Telenachos, Aleksandr Melnikov, Jijoong Kim, Liron Ilouz, Nikita Glazov, Priya4607, Rush Tehrani, Seungwon Jeong, Vladimir Skubriev, Sebastian Yonekura, Vugia Truong, Zliang7, Lizhming, and Tritin Truong. 2020. opencv/cvat: v1.1.0. https://doi.org/10.5281/zenodo.4009388
[30]
Zejiang Shen, Kaixuan Zhang, and Melissa Dell. 2020. A Large Dataset of Historical Japanese Documents with Complex Layouts. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 548–549. https://doi.org/10.1109/cvprw50498.2020.00282
[31]
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11(2017), 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371
[32]
Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust Scene Text Recognition with Automatic Rectification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168–4176. https://doi.org/10.1109/cvpr.2016.452
[33]
Leslie N. Smith. 2017. Cyclical Learning Rates for Training Neural Networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 464–472. https://doi.org/10.1109/WACV.2017.58
[34]
Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing. Denver, Colorado, USA.
[35]
Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for OCR. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS’17). 334–343. https://dl.acm.org/doi/abs/10.5555/3294771.3294803

Cited By

View all
  • (2024)Text Reuse Detection in Handwritten DocumentsDoklady Mathematics10.1134/S106456242370120X108:S2(S424-S433)Online publication date: 11-Mar-2024
  • (2024)U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscriptsNeural Computing and Applications10.1007/s00521-023-09356-536:20(11777-11789)Online publication date: 1-Jul-2024
  • (2024)Bridging the Gap in Resource for Offline English Handwritten Text RecognitionDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_25(413-428)Online publication date: 3-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing
September 2021
72 pages
ISBN:9781450386906
DOI:10.1145/3476887
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Digital Peter
  2. Russian
  3. handwritten text recognition
  4. historical dataset

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HIP '21

Acceptance Rates

Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)6
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Text Reuse Detection in Handwritten DocumentsDoklady Mathematics10.1134/S106456242370120X108:S2(S424-S433)Online publication date: 11-Mar-2024
  • (2024)U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscriptsNeural Computing and Applications10.1007/s00521-023-09356-536:20(11777-11789)Online publication date: 1-Jul-2024
  • (2024)Bridging the Gap in Resource for Offline English Handwritten Text RecognitionDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_25(413-428)Online publication date: 3-Sep-2024
  • (2024)Handwritten Text Recognition and Browsing in Archive of Prisoners’ Letters from Smolensk Convict PrisonAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_16(227-240)Online publication date: 12-Mar-2024
  • (2023)Handwritten Paragraph Recognition Using Spatial Information on Russian Notebooks Dataset2023 34th Conference of Open Innovations Association (FRUCT)10.23919/FRUCT60429.2023.10328173(108-113)Online publication date: 15-Nov-2023
  • (2023)Line extraction in handwritten documents via instance segmentationInternational Journal on Document Analysis and Recognition10.1007/s10032-023-00438-726:3(335-346)Online publication date: 21-May-2023
  • (2022)Handwritten text generation and strikethrough characters augmentationComputer Optics10.18287/2412-6179-CO-104946:3Online publication date: Jun-2022
  • (2022)A survey of historical document image datasetsInternational Journal on Document Analysis and Recognition10.1007/s10032-022-00405-825:4(305-338)Online publication date: 1-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media