research-article

Digital Peter: New Dataset, Competition and Handwriting Recognition Methods

Authors:

Denis Dimitrov,

Alex Shonenkov,

Vladimir Bataev,

Denis Karachev,

Maxim Novopoltsev,

Andrey ChertokAuthors Info & Claims

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing

Pages 43 - 48

https://doi.org/10.1145/3476887.3476892

Published: 31 October 2021 Publication History

Abstract

This paper presents a new dataset of Peter the Great’s manuscripts and describes a segmentation procedure that converts initial images of documents into lines. This new dataset may be useful for researchers to train handwriting text recognition models as a benchmark when comparing different models. It consists of 9694 images and text files corresponding to different lines in historical documents. The open machine learning competition ”Digital Peter” was held based on the considered dataset. The baseline solution for this competition and advanced methods on handwritten text recognition are described in the article. The full dataset and all codes are publicly available.

References

[1]

J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4714–4722. https://doi.org/10.1109/ICCV.2019.00481

[2]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’18). ACM, 71–79. https://doi.org/10.1145/3219819.3219861

Digital Library

[3]

Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. 2020. Albumentations: Fast and Flexible Image Augmentations. Information 11, 2 (2020), 125. https://doi.org/10.3390/info11020125

[4]

Tim Causer, Kris Grint, Anna-Maria Sichani, and Melissa Terras. 2018. ‘Making such bargain’: Transcribe Bentham and the quality and cost-effectiveness of crowdsourced transcription. Digital Scholarship in the Humanities 33, 3 (2018), 467–487. https://doi.org/10.1093/llc/fqx064

[5]

P. Gayathri and S. Ayyappan. 2014. Off-line handwritten character recognition using Hidden Markov Model. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). 518–523. https://doi.org/10.1109/ICACCI.2014.6968488

[6]

Github. 2020. CTC Beam Search Decoding for PyTorch. https://github.com/parlance/ctcdecode

[7]

Github. 2020. Data and description. https://github.com/MarkPotanin/DigitalPeter

[8]

Github. 2021. Link for comparison. https://github.com/shonenkov/Digital-Peter-Model-Comparisons

[9]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd international conference on Machine learning - ICML '06. ACM Press, 369–376. https://doi.org/10.1145/1143844.1143891

Digital Library

[10]

Tobias Grüning, Roger Labahn, Markus Diem, Florian Kleber, and Stefan Fiel. 2018. READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 351–356. https://doi.org/10.1109/das.2018.38

[11]

Tobias Grüning, Gundram Leifert, Tobias Strauß, Johannes Michael, and Roger Labahn. 2019. A two-stage method for text line detection in historical documents. International Journal on Document Analysis and Recognition (IJDAR) 22, 3(2019), 285–302. https://doi.org/10.1007/s10032-019-00332-1

Digital Library

[12]

Awni Hannun. 2017. Sequence Modeling with CTC. Distill 2, 11 (2017), e8. https://doi.org/10.23915/distill.00008

[13]

Awni Y. Hannun 2014. Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567(2014).

[14]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2961–2969. https://doi.org/10.1109/iccv.2017.322

[15]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc.https://dl.acm.org/doi/10.5555/2969442.2969465

[16]

Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2231–2239. https://doi.org/10.1109/CVPR.2016.245

[17]

Gundram Leifert, Roger Labahn, and Tobias Strauß. 2014. CITlab ARGUS for Arabic handwriting. arXiv preprint arXiv:1412.6061(2014).

[18]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014. Springer International Publishing, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

[19]

Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 43.1–43.13. https://doi.org/10.5244/C.30.43

[20]

O. N. Lyashevskaya, T. O. Shavrina, I. V. Trofimov, N. A. Vlasova, 2020. GRAMEVAL 2020 SHARED TASK: RUSSIAN FULL MORPHOLOGY AND UNIVERSAL DEPENDENCIES PARSING. In Computational Linguistics and Intellectual Technologies, Vol. 20. Russian State University for the Humanities. https://doi.org/10.28995/2075-7182-2020-19-553-569

[21]

U.-V. Marti and H. Bunke. 2002. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5, 1(2002), 39–46. https://doi.org/10.1007/s100320200071

[22]

Michael Murdock, Shawn Reid, Blaine Hamilton, and Jackson Reese. 2015. ICDAR 2015 competition on text line detection in historical documents. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1171–1175. https://doi.org/10.1109/icdar.2015.7333945

Digital Library

[23]

Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev, Anel Alimova, and Abdelrahman Abdallah. 2020. HKR For Handwritten Kazakh & Russian Database. arXiv preprint arXiv:2007.03579(2020).

[24]

Abhishek Prusty, Sowmya Aitha, Abhishek Trivedi, and Ravi Kiran Sarvadevabhatla. 2019. Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 999–1006. https://doi.org/10.1109/icdar.2019.00164

[25]

M. Rajnoha, R. Burget, and M. K. Dutta. 2017. Offline handwritten text recognition using support vector machines. In 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). 132–136. https://doi.org/10.1109/SPIN.2017.8049930

[26]

Guillaume Renton, Yann Soullard, Clément Chatelain, Sébastien Adam, Christopher Kermorvant, and Thierry Paquet. 2018. Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR) 21, 3(2018), 177–186. https://doi.org/10.1007/s10032-018-0304-3

Digital Library

[27]

J. Ryu, H. I. Koo, and N. I. Cho. 2014. Language-Independent Text-Line Extraction Algorithm for Handwritten Documents. IEEE Signal Processing Letters 21, 9 (2014), 1115–1119. https://doi.org/10.1109/LSP.2014.2325940

[28]

Sber. 2020. Link for competition. https://ods.ai/tracks/aij2020/competitions/aij-petr

[29]

Boris Sekachev, Nikita Manovich, Maxim Zhiltsov, Andrey Zhavoronkov, Dmitry Kalinin, Ben Hoff, TOsmanov, Dmitry Kruchinin, Artyom Zankevich, DmitriySidnev, Maksim Markelov, Johannes222, Mathis Chenuet, A-Andre, Telenachos, Aleksandr Melnikov, Jijoong Kim, Liron Ilouz, Nikita Glazov, Priya4607, Rush Tehrani, Seungwon Jeong, Vladimir Skubriev, Sebastian Yonekura, Vugia Truong, Zliang7, Lizhming, and Tritin Truong. 2020. opencv/cvat: v1.1.0. https://doi.org/10.5281/zenodo.4009388

[30]

Zejiang Shen, Kaixuan Zhang, and Melissa Dell. 2020. A Large Dataset of Historical Japanese Documents with Complex Layouts. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 548–549. https://doi.org/10.1109/cvprw50498.2020.00282

[31]

Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11(2017), 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371

Digital Library

[32]

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust Scene Text Recognition with Automatic Rectification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168–4176. https://doi.org/10.1109/cvpr.2016.452

[33]

Leslie N. Smith. 2017. Cyclical Learning Rates for Training Neural Networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 464–472. https://doi.org/10.1109/WACV.2017.58

[34]

Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing. Denver, Colorado, USA.

[35]

Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for OCR. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS’17). 334–343. https://dl.acm.org/doi/abs/10.5555/3294771.3294803

Digital Library

Cited By

Grabovoy AKaprielova MKildyakov APotyashin ISeyil TFinogeev EChekhovich Y(2024)Text Reuse Detection in Handwritten DocumentsDoklady Mathematics10.1134/S106456242370120X108:S2(S424-S433)Online publication date: 11-Mar-2024
https://doi.org/10.1134/S106456242370120X
Zottin SDe Nardin AColombi EPiciarelli CPavan FForesti G(2024)U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscriptsNeural Computing and Applications10.1007/s00521-023-09356-536:20(11777-11789)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00521-023-09356-5
Mondal ATulsyan KJawahar C(2024)Bridging the Gap in Resource for Offline English Handwritten Text RecognitionDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_25(413-428)Online publication date: 3-Sep-2024
https://doi.org/10.1007/978-3-031-70536-6_25
Show More Cited By

Recommendations

ICDAR 2013 Chinese Handwriting Recognition Competition
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition

This paper describes the Chinese handwriting recognition competition held at the 12th International Conference on Document Analysis and Recognition (ICDAR 2013). This third competition in the series again used the CASIA-HWDB/OLHWDB databases as the ...
ICDAR 2011 Chinese Handwriting Recognition Competition
ICDAR '11: Proceedings of the 2011 International Conference on Document Analysis and Recognition

In the Chinese handwriting recognition competition organized with the ICDAR 2011, four tasks were evaluated: offline and online isolated character recognition, offline and online handwritten text recognition. To enable the training of recognition systems,...
iiit-indic-hw-words: A Dataset for Indic Handwritten Text Recognition
Document Analysis and Recognition – ICDAR 2021
Abstract
Handwritten text recognition (htr) for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing

September 2021

72 pages

ISBN:9781450386906

DOI:10.1145/3476887

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HIP '21

HIP '21: The 6th International Workshop on Historical Document Imaging and Processing

September 5 - 6, 2021

Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
142
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)6

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Grabovoy AKaprielova MKildyakov APotyashin ISeyil TFinogeev EChekhovich Y(2024)Text Reuse Detection in Handwritten DocumentsDoklady Mathematics10.1134/S106456242370120X108:S2(S424-S433)Online publication date: 11-Mar-2024
https://doi.org/10.1134/S106456242370120X
Zottin SDe Nardin AColombi EPiciarelli CPavan FForesti G(2024)U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscriptsNeural Computing and Applications10.1007/s00521-023-09356-536:20(11777-11789)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00521-023-09356-5
Mondal ATulsyan KJawahar C(2024)Bridging the Gap in Resource for Offline English Handwritten Text RecognitionDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70536-6_25(413-428)Online publication date: 3-Sep-2024
https://doi.org/10.1007/978-3-031-70536-6_25
Lomov NKropotov DStepochkin DLaptev A(2024)Handwritten Text Recognition and Browsing in Archive of Prisoners’ Letters from Smolensk Convict PrisonAnalysis of Images, Social Networks and Texts10.1007/978-3-031-54534-4_16(227-240)Online publication date: 12-Mar-2024
https://doi.org/10.1007/978-3-031-54534-4_16
Mohammed STeslya N(2023)Handwritten Paragraph Recognition Using Spatial Information on Russian Notebooks Dataset2023 34th Conference of Open Innovations Association (FRUCT)10.23919/FRUCT60429.2023.10328173(108-113)Online publication date: 15-Nov-2023
https://doi.org/10.23919/FRUCT60429.2023.10328173
Islam AAnjum TKhan N(2023)Line extraction in handwritten documents via instance segmentationInternational Journal on Document Analysis and Recognition10.1007/s10032-023-00438-726:3(335-346)Online publication date: 21-May-2023
https://dl.acm.org/doi/10.1007/s10032-023-00438-7
Shonenkov AKarachev DNovopoltsev MPotanin MDimitrov DChertok A(2022)Handwritten text generation and strikethrough characters augmentationComputer Optics10.18287/2412-6179-CO-104946:3Online publication date: Jun-2022
https://doi.org/10.18287/2412-6179-CO-1049
Nikolaidou KSeuret MMokayed HLiwicki M(2022)A survey of historical document image datasetsInternational Journal on Document Analysis and Recognition10.1007/s10032-022-00405-825:4(305-338)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10032-022-00405-8

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten