Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Unconstrained handwritten document retrieval

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)

  2. Beitzel, S.M., Jensen, E.C., Grossman D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of the Symposium on Document Image Understanding Technologies. Greenbelt, Maryland, April 2003

  3. Bhardwaj, A., Farooq, F., Cao, H., Govindaraju, V.: Topic based language models for OCR correction. In: Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 107–112, Singapore (2008)

  4. Bhardwaj, A., Jose, D., Govindaraju, V.: Script independent keyword spotting for multilingual documents. In: Cross Lingual Information Access Workshop (2008)

  5. Cao, H., Govindaraju, V.: Template-free word spotting in low- quality manuscripts. International Conference on Advances in Pattern Recognition (2007)

  6. Cao, H., Govindaraju, V.: Handwritten carbon form preprocessing based on Markov random field. In: Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07) (2007)

  7. Cao, H., Govindaraju, V.: Vector model based indexing and retrieval of handwritten medical forms. In: Proceedings of Ninth Internation Conference on Document Analysis and Recognition (ICDAR) 1, 88–92 (2007)

  8. Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. In: J. Pattern Recognit. 42(12), Elsevier Press (2009)

  9. Choi S.C., Wette R.: Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics 11(4), 683–690 (1969)

    Article  MATH  Google Scholar 

  10. Choisy, C.: Dynamic handwritten keyword spotting based on the NSHP-HMM. In: International Conference on Document Analysis and Recognition, pp. 242–246. ICDAR (2007)

  11. Croft, W.B., Harding, S.M., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the Symposium on Document Analysis and Information Retrieval (1994)

  12. Frinken V., Fischer A., Bunke H.: Combining neural networks to improve performance of handwritten keyword spotting. Mult. Classif. Syst. 5997, 215–224 (2010)

    Article  Google Scholar 

  13. Howe, N.R., Rath, T.M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the SIGIR, pp. 377–383 (2005)

  14. Jing H.: Using hidden Markov modeling to decompose human-written summaries. Comput. Linguis. 28(4), 527–543 (2002)

    Article  Google Scholar 

  15. Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)

    Article  Google Scholar 

  16. Lee, D.-R., Kim, W.-Y., Oh, I.-S.: Hangul document image retrieval system using rank-based recognition. In: Proceedings of the International Conference on Document Analysis and Recognition 2, 615–619 (2005)

  17. Milewski R., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J Doc. Anal. Recognit. 11(4), 203–218 (2009)

    Article  Google Scholar 

  18. Manmatha R., Han C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. Computer Vision and Pattern Recognit, p. 631. CVPR (1996)

  19. Mittendorf, E., Schauble, P., Sheridan, P.: Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In: Research and Development in Information Retrieval, pp. 328–335 (1995)

  20. Ohta, M., Takasu, A., Adachi, J.: Retrieval methods for English text with misrecognized OCR characters. In: Proceedings of the International Conference on Document Analysis and Recognition (1997)

  21. Perronnin, F., Rodriguez-Serrano, J.A.: Fisher kernels for handwritten word-spotting. International Conference on Document Analysis and Recognition, pp. 106–110. ICDAR (2009)

  22. Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (2004)

  23. Reynolds D.A., Quatieri T.F., Dunn R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)

    Article  Google Scholar 

  24. Rodriguez-Serrano, J.A., Perronnin, F.: Local gradient histogram features for word spotting in unconstrained handwritten documents. In: International Conference on Frontiers in Handwriting Recognition. ICFHR (2008)

  25. Rodriguez-Serrano J.A., Perronnin F.: Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit. 42(9), 2106–2116 (2009)

    Article  MATH  Google Scholar 

  26. Terasawa K., Nagasaki T., Kawashima T.: Automatic keyword extraction from historical document images. Doc. Anal. Syst. VII 3872, 413–424 (2006)

    Article  Google Scholar 

  27. van der Zant T., Schomaker L., Haak K.: Handwritten word- spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1945–1957 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaigu Cao.

Additional information

Dr. Cao’s work presented in this article was done during the fulfillment of his Ph.D. degree in the Department of Computer Science and Engineering, University at Buffalo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, H., Govindaraju, V. & Bhardwaj, A. Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011). https://doi.org/10.1007/s10032-010-0139-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-010-0139-z

Keywords

Navigation