Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Published: 10 May 2024 Publication History

Abstract

Scholars in the humanities heavily rely on ancient manuscripts to study history, religion, and socio-political structures of the past. Significant efforts have been devoted to digitizing these precious manuscripts using OCR technology. However, most manuscripts have been blemished over the centuries, making it unrealistic for OCR programs to accurately capture faded characters. This work presents the Transformer + Confidence Score mechanism architecture for post-processing Google’s Tibetan OCR-ed outputs. According to the Loss and Character Error Rate metrics, our Transformer + Confidence Score mechanism architecture proves superior to the Transformer, LSTM-to-LSTM, and GRU-to-GRU architectures. Our method can be adapted to any language dealing with post-processing OCR outputs.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
Yo Joong Choe, Jiyeon Ham, Kyubyong Park, and Yeoil Yoon. 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 213–227. DOI:
[3]
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631–1640.
[4]
Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, and Vésteinn Snæbjarnarson. 2023. Byte-level grammatical error correction using synthetic and curated corpora. arXiv preprint arXiv:2305.17906 (2023).
[5]
Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 12–22.
[6]
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 595–606. DOI:
[7]
Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv 2021. arXiv preprint arXiv:2109.10282 (2021).
[8]
Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical error correction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3291–3301. DOI:
[9]
Aishik Rakshit, Samyak Mehta, and Anirban Dasgupta. 2023. A novel pipeline for improving optical character recognition through post-processing using natural language processing. In Proceedings of the IEEE Guwahati Subsection Conference (GCON’23). IEEE, 01–06.
[10]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1073–1083.
[11]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1715–1725. DOI:
[12]
H. A. Z. Shahgir and Khondker Salman Sayeed. 2023. Bangla grammatical error detection using T5 transformer model. arXiv preprint arXiv:2303.10612 (2023).
[13]
Nishant Subramani, Alexandre Matton, Malcolm Greaves, and Adrian Lam. 2020. A survey of deep learning approaches for OCR and document understanding. arXiv preprint arXiv:2011.13534 (2020).
[14]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014). DOI:
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017). DOI:
[16]
Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 156–165. DOI:

Index Terms

  1. Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 5
    May 2024
    297 pages
    EISSN:2375-4702
    DOI:10.1145/3613584
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 May 2024
    Online AM: 30 March 2024
    Accepted: 17 March 2024
    Revised: 15 November 2023
    Received: 24 June 2023
    Published in TALLIP Volume 23, Issue 5

    Check for updates

    Author Tags

    1. Post-processing OCR output
    2. Transformer
    3. neural networks

    Qualifiers

    • Short-paper

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 74
      Total Downloads
    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media