short-paper

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Authors:

Yung-Sung ChuangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 5

Article No.: 73, Pages 1 - 11

https://doi.org/10.1145/3654811

Published: 10 May 2024 Publication History

Abstract

Scholars in the humanities heavily rely on ancient manuscripts to study history, religion, and socio-political structures of the past. Significant efforts have been devoted to digitizing these precious manuscripts using OCR technology. However, most manuscripts have been blemished over the centuries, making it unrealistic for OCR programs to accurately capture faded characters. This work presents the Transformer + Confidence Score mechanism architecture for post-processing Google’s Tibetan OCR-ed outputs. According to the Loss and Character Error Rate metrics, our Transformer + Confidence Score mechanism architecture proves superior to the Transformer, LSTM-to-LSTM, and GRU-to-GRU architectures. Our method can be adapted to any language dealing with post-processing OCR outputs.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[2]

Yo Joong Choe, Jiyeon Ham, Kyubyong Park, and Yeoil Yoon. 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 213–227. DOI:

[3]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631–1640.

[4]

Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, and Vésteinn Snæbjarnarson. 2023. Byte-level grammatical error correction using synthetic and curated corpora. arXiv preprint arXiv:2305.17906 (2023).

[5]

Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 12–22.

[6]

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 595–606. DOI:

[7]

Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv 2021. arXiv preprint arXiv:2109.10282 (2021).

[8]

Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical error correction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3291–3301. DOI:

[9]

Aishik Rakshit, Samyak Mehta, and Anirban Dasgupta. 2023. A novel pipeline for improving optical character recognition through post-processing using natural language processing. In Proceedings of the IEEE Guwahati Subsection Conference (GCON’23). IEEE, 01–06.

[10]

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1073–1083.

[11]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1715–1725. DOI:

[12]

H. A. Z. Shahgir and Khondker Salman Sayeed. 2023. Bangla grammatical error detection using T5 transformer model. arXiv preprint arXiv:2303.10612 (2023).

[13]

Nishant Subramani, Alexandre Matton, Malcolm Greaves, and Adrian Lam. 2020. A survey of deep learning approaches for OCR and document understanding. arXiv preprint arXiv:2011.13534 (2020).

[14]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014). DOI:

[15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017). DOI:

[16]

Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 156–165. DOI:

Index Terms

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

Context-based spelling correction for Japanese OCR
COLING '96: Proceedings of the 16th conference on Computational linguistics - Volume 2

We present a novel spelling correction method for those languages that have no delimiter between words, such as Japanese, Chinese, and Thai. It consists of an approximate word matching method and an N-best word segmentation algorithm using a statistical ...
Recognising handwritten Arabic manuscripts using a single hidden Markov model

This paper presents a new method on off-line recognition of handwritten Arabic script. The method does not require segmentation into characters, and is applied to cursive Arabic script, where ligatures, overlaps and style variation pose challenges to ...
Neural Networks Pipeline for Offline Machine Printed Arabic OCR

In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 5

May 2024

297 pages

EISSN:2375-4702

DOI:10.1145/3613584

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2024

Online AM: 30 March 2024

Accepted: 17 March 2024

Revised: 15 November 2023

Received: 24 June 2023

Published in TALLIP Volume 23, Issue 5

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
74
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents