Nothing Special   »   [go: up one dir, main page]

Skip to main content

Confidence-Aware Document OCR Error Detection

  • Conference paper
  • First Online:
Document Analysis Systems (DAS 2024)

Abstract

Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-read?view=doc-intel-4.0.0.

  2. 2.

    https://aws.amazon.com/textract/.

  3. 3.

    https://cloud.google.com/use-cases/ocr?hl=en.

  4. 4.

    https://github.com/mindee/doctr.

  5. 5.

    https://github.com/JaidedAI/EasyOCR.

  6. 6.

    https://github.com/PaddlePaddle/PaddleOCR.

References

  1. Adesam, Y., Dannélls, D., Tahmasebi, N.: Exploring the quality of the digital historical newspaper archive KubHist. DHN 9, 17 (2019)

    Google Scholar 

  2. Amrhein, C., Clematide, S.: Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Linguist. (JLCL) 33(1), 49–76 (2018)

    Article  Google Scholar 

  3. Arachchige, P., Randika, A.: Unknown-box approximation to improve optical character recognition performance (2021)

    Google Scholar 

  4. Baek, Y., et al.: CLEval: character-level evaluation for text detection and recognition tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 564–565 (2020)

    Google Scholar 

  5. Boros, E., Nguyen, N.K., Lejeune, G., Doucet, A.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digit. Libr. 23(3), 241–266 (2022)

    Article  Google Scholar 

  6. Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 286–293 (2000)

    Google Scholar 

  7. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1423–1428. IEEE (2017)

    Google Scholar 

  8. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1, 93–103 (1991)

    Article  Google Scholar 

  9. Cuper, M., van Dongen, C., Koster, T.: Unraveling confidence: examining confidence scores as proxy for OCR quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14191, pp. 104–120. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_7

    Chapter  Google Scholar 

  10. Du, Y., et al.: PP-OCR: a practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941 (2020)

  11. Fleischhacker, D., Goederle, W., Kern, R.: Improving OCR quality in 19th century historical documents using a combined machine learning based approach. arXiv preprint arXiv:2401.07787 (2024)

  12. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330. PMLR (2017)

    Google Scholar 

  13. Gupta, A., et al.: Automatic assessment of OCR quality in historical documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)

    Google Scholar 

  14. Hajiali, M., Fonseca Cacho, J.R., Taghva, K.: Generating correction candidates for OCR errors using BERT language model and FastText SubWord embeddings. In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 283, pp. 1045–1053. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-80119-9_69

    Chapter  Google Scholar 

  15. Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7

    Chapter  Google Scholar 

  16. Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Nat. Lang. Eng. 29(2), 425–448 (2023)

    Article  Google Scholar 

  17. Hemmer, A., Brachat, J., Coustaty, M., Ogier, J.M.: Estimating post-OCR denoising complexity on numerical texts. In: Nguyen, N.T., et al. (eds.) ACIIDS 2023. CCIS, vol. 1863, pp. 67–79. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42430-4_6

    Chapter  Google Scholar 

  18. Hill, M.J., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Humanit. 34(4), 825–843 (2019)

    Article  Google Scholar 

  19. Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)

    Google Scholar 

  20. Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)

    Google Scholar 

  21. Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Post-OCR error detection by generating plausible candidates. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 876–881. IEEE (2019)

    Google Scholar 

  22. Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)

    Google Scholar 

  23. Kim, G., et al.: OCR-free document understanding transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29

    Chapter  Google Scholar 

  24. Mindee: doctr: Document text recognition (2021). https://github.com/mindee/doctr

  25. Mutuvi, S., Doucet, A., Odeo, M., Jatowt, A.: Evaluating the impact of OCR errors on topic modeling. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 3–14. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_1

    Chapter  Google Scholar 

  26. Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., Pletschacher, S.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)

    Google Scholar 

  27. Nguyen, T.-T.-H., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, N.-V.: Adaptive edit-distance and regression approach for post-OCR text correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 278–289. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_29

    Chapter  Google Scholar 

  28. Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. (CSUR) 54(6), 1–37 (2021)

    Article  Google Scholar 

  29. Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 333–336 (2020)

    Google Scholar 

  30. Olejniczak, K., Šulc, M.: Text detection forgot about document OCR. arXiv preprint arXiv:2210.07903 (2022)

  31. de Oliveira, L.L., et al.: Evaluating and mitigating the impact of OCR errors on information retrieval. Int. J. Digit. Libr. 24(1), 45–62 (2023)

    Article  Google Scholar 

  32. Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  33. Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)

    Google Scholar 

  34. Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593. IEEE (2019)

    Google Scholar 

  35. Rotman, D., Azulai, O., Shapira, I., Burshtein, Y., Barzelay, U.: Detection masking for improved OCR on noisy documents. arXiv preprint arXiv:2205.08257 (2022)

  36. Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  37. Spithourakis, G.P., Riedel, S.: Numeracy for language models: evaluating and improving their ability to predict numbers. arXiv preprint arXiv:1805.08154 (2018)

  38. Springmann, U., Fink, F., Schulz, K.U.: Automatic quality evaluation and (semi-)automatic improvement of OCR models for historical printings. arXiv preprint arXiv:1606.05157 (2016)

  39. Subramani, N., Matton, A., Greaves, M., Lam, A.: A survey of deep learning approaches for OCR and document understanding. arXiv preprint arXiv:2011.13534 (2020)

  40. Todorov, K., Colavizza, G.: An assessment of the impact of OCR noise on language models. arXiv preprint arXiv:2202.00470 (2022)

  41. Topçu, A.İ., Töreyin, B.U.: Neural machine translation approaches for post-OCR text processing. In: 2022 30th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE (2022)

    Google Scholar 

  42. Van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)

    Google Scholar 

  43. Yasin, N., Siddiqi, I., Moetesum, M., Rauf, S.A.: Transformer-based neural machine translation for post-OCR error correction in cursive text. In: Coustaty, M., Fornés, A. (eds.) ICDAR 2023. LNCS, vol. 14194, pp. 80–93. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41501-2_6

    Chapter  Google Scholar 

  44. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)

    Google Scholar 

Download references

Acknowledgments

This work was granted access to the HPC/AI resources of IDRIS under the allocation AD010614769 made by GENCI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arthur Hemmer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hemmer, A., Coustaty, M., Bartolo, N., Ogier, JM. (2024). Confidence-Aware Document OCR Error Detection. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70442-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70441-3

  • Online ISBN: 978-3-031-70442-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics