Confidence-Aware Document OCR Error Detection

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14994))

Included in the following conference series:

International Workshop on Document Analysis Systems

292 Accesses

Abstract

Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

TrOCR Meets Language Models: An End-to-End Post-correction Approach

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

Notes

References

Adesam, Y., Dannélls, D., Tahmasebi, N.: Exploring the quality of the digital historical newspaper archive KubHist. DHN 9, 17 (2019)
Google Scholar
Amrhein, C., Clematide, S.: Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Linguist. (JLCL) 33(1), 49–76 (2018)
Article Google Scholar
Arachchige, P., Randika, A.: Unknown-box approximation to improve optical character recognition performance (2021)
Google Scholar
Baek, Y., et al.: CLEval: character-level evaluation for text detection and recognition tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 564–565 (2020)
Google Scholar
Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digit. Libr. 23(3), 241–266 (2022)
Article Google Scholar
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 286–293 (2000)
Google Scholar
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1423–1428. IEEE (2017)
Google Scholar
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1, 93–103 (1991)
Article Google Scholar
Cuper, M., van Dongen, C., Koster, T.: Unraveling confidence: examining confidence scores as proxy for OCR quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14191, pp. 104–120. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_7
Chapter Google Scholar
Du, Y., et al.: PP-OCR: a practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941 (2020)
Fleischhacker, D., Goederle, W., Kern, R.: Improving OCR quality in 19th century historical documents using a combined machine learning based approach. arXiv preprint arXiv:2401.07787 (2024)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330. PMLR (2017)
Google Scholar
Gupta, A., et al.: Automatic assessment of OCR quality in historical documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Google Scholar
Hajiali, M., Fonseca Cacho, J.R., Taghva, K.: Generating correction candidates for OCR errors using BERT language model and FastText SubWord embeddings. In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 283, pp. 1045–1053. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-80119-9_69
Chapter Google Scholar
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Chapter Google Scholar
Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Nat. Lang. Eng. 29(2), 425–448 (2023)
Article Google Scholar
Hemmer, A., Brachat, J., Coustaty, M., Ogier, J.M.: Estimating post-OCR denoising complexity on numerical texts. In: Nguyen, N.T., et al. (eds.) ACIIDS 2023. CCIS, vol. 1863, pp. 67–79. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42430-4_6
Chapter Google Scholar
Hill, M.J., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Humanit. 34(4), 825–843 (2019)
Article Google Scholar
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Google Scholar
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)
Google Scholar
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Post-OCR error detection by generating plausible candidates. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 876–881. IEEE (2019)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29
Chapter Google Scholar
Mindee: doctr: Document text recognition (2021). https://github.com/mindee/doctr
Mutuvi, S., Doucet, A., Odeo, M., Jatowt, A.: Evaluating the impact of OCR errors on topic modeling. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 3–14. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_1
Chapter Google Scholar
Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., Pletschacher, S.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)
Google Scholar
Nguyen, T.-T.-H., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, N.-V.: Adaptive edit-distance and regression approach for post-OCR text correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 278–289. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_29
Chapter Google Scholar
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. (CSUR) 54(6), 1–37 (2021)
Article Google Scholar
Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 333–336 (2020)
Google Scholar
Olejniczak, K., Šulc, M.: Text detection forgot about document OCR. arXiv preprint arXiv:2210.07903 (2022)
de Oliveira, L.L., et al.: Evaluating and mitigating the impact of OCR errors on information retrieval. Int. J. Digit. Libr. 24(1), 45–62 (2023)
Article Google Scholar
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)
Google Scholar
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593. IEEE (2019)
Google Scholar
Rotman, D., Azulai, O., Shapira, I., Burshtein, Y., Barzelay, U.: Detection masking for improved OCR on noisy documents. arXiv preprint arXiv:2205.08257 (2022)
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Spithourakis, G.P., Riedel, S.: Numeracy for language models: evaluating and improving their ability to predict numbers. arXiv preprint arXiv:1805.08154 (2018)
Springmann, U., Fink, F., Schulz, K.U.: Automatic quality evaluation and (semi-)automatic improvement of OCR models for historical printings. arXiv preprint arXiv:1606.05157 (2016)
Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for OCR and document understanding. arXiv preprint arXiv:2011.13534 (2020)
Todorov, K., Colavizza, G.: An assessment of the impact of OCR noise on language models. arXiv preprint arXiv:2202.00470 (2022)
Topçu, A.İ., Töreyin, B.U.: Neural machine translation approaches for post-OCR text processing. In: 2022 30th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE (2022)
Google Scholar
Van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)
Google Scholar
Yasin, N., Siddiqi, I., Moetesum, M., Rauf, S.A.: Transformer-based neural machine translation for post-OCR error correction in cursive text. In: Coustaty, M., Fornés, A. (eds.) ICDAR 2023. LNCS, vol. 14194, pp. 80–93. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41501-2_6
Chapter Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar

Download references

Acknowledgments

This work was granted access to the HPC/AI resources of IDRIS under the allocation AD010614769 made by GENCI.

Author information

Authors and Affiliations

Shift Technology, Paris, France
Arthur Hemmer & Nicola Bartolo
L3i La Rochelle, La Rochelle, France
Arthur Hemmer, Mickaël Coustaty & Jean-Marc Ogier

Authors

Arthur Hemmer
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Bartolo
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Ogier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Hemmer .

Editor information

Editors and Affiliations

University of West Attica, Egaleo, Greece
Giorgos Sfikas
National Technical University of Athens, Zografou, Greece
George Retsinas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hemmer, A., Coustaty, M., Bartolo, N., Ogier, JM. (2024). Confidence-Aware Document OCR Error Detection. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-70442-0_13
Published: 11 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70441-3
Online ISBN: 978-3-031-70442-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Confidence-Aware Document OCR Error Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

TrOCR Meets Language Models: An End-to-End Post-correction Approach

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Confidence-Aware Document OCR Error Detection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

TrOCR Meets Language Models: An End-to-End Post-correction Approach

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation