Unconstrained handwritten document retrieval

Huaigu Cao¹,
Venu Govindaraju² &
Anurag Bhardwaj²

224 Accesses
11 Citations
Explore all metrics

Abstract

With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handwritten Text Retrieval from Unlabeled Collections

Lexicon-based probabilistic indexing of handwritten text images

Article Open access 10 May 2023

Approximate Search for Keywords in Handwritten Text Images

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)
Beitzel, S.M., Jensen, E.C., Grossman D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of the Symposium on Document Image Understanding Technologies. Greenbelt, Maryland, April 2003
Bhardwaj, A., Farooq, F., Cao, H., Govindaraju, V.: Topic based language models for OCR correction. In: Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 107–112, Singapore (2008)
Bhardwaj, A., Jose, D., Govindaraju, V.: Script independent keyword spotting for multilingual documents. In: Cross Lingual Information Access Workshop (2008)
Cao, H., Govindaraju, V.: Template-free word spotting in low- quality manuscripts. International Conference on Advances in Pattern Recognition (2007)
Cao, H., Govindaraju, V.: Handwritten carbon form preprocessing based on Markov random field. In: Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07) (2007)
Cao, H., Govindaraju, V.: Vector model based indexing and retrieval of handwritten medical forms. In: Proceedings of Ninth Internation Conference on Document Analysis and Recognition (ICDAR) 1, 88–92 (2007)
Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. In: J. Pattern Recognit. 42(12), Elsevier Press (2009)
Choi S.C., Wette R.: Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics 11(4), 683–690 (1969)
Article MATH Google Scholar
Choisy, C.: Dynamic handwritten keyword spotting based on the NSHP-HMM. In: International Conference on Document Analysis and Recognition, pp. 242–246. ICDAR (2007)
Croft, W.B., Harding, S.M., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the Symposium on Document Analysis and Information Retrieval (1994)
Frinken V., Fischer A., Bunke H.: Combining neural networks to improve performance of handwritten keyword spotting. Mult. Classif. Syst. 5997, 215–224 (2010)
Article Google Scholar
Howe, N.R., Rath, T.M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the SIGIR, pp. 377–383 (2005)
Jing H.: Using hidden Markov modeling to decompose human-written summaries. Comput. Linguis. 28(4), 527–543 (2002)
Article Google Scholar
Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)
Article Google Scholar
Lee, D.-R., Kim, W.-Y., Oh, I.-S.: Hangul document image retrieval system using rank-based recognition. In: Proceedings of the International Conference on Document Analysis and Recognition 2, 615–619 (2005)
Milewski R., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J Doc. Anal. Recognit. 11(4), 203–218 (2009)
Article Google Scholar
Manmatha R., Han C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. Computer Vision and Pattern Recognit, p. 631. CVPR (1996)
Mittendorf, E., Schauble, P., Sheridan, P.: Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In: Research and Development in Information Retrieval, pp. 328–335 (1995)
Ohta, M., Takasu, A., Adachi, J.: Retrieval methods for English text with misrecognized OCR characters. In: Proceedings of the International Conference on Document Analysis and Recognition (1997)
Perronnin, F., Rodriguez-Serrano, J.A.: Fisher kernels for handwritten word-spotting. International Conference on Document Analysis and Recognition, pp. 106–110. ICDAR (2009)
Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (2004)
Reynolds D.A., Quatieri T.F., Dunn R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)
Article Google Scholar
Rodriguez-Serrano, J.A., Perronnin, F.: Local gradient histogram features for word spotting in unconstrained handwritten documents. In: International Conference on Frontiers in Handwriting Recognition. ICFHR (2008)
Rodriguez-Serrano J.A., Perronnin F.: Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit. 42(9), 2106–2116 (2009)
Article MATH Google Scholar
Terasawa K., Nagasaki T., Kawashima T.: Automatic keyword extraction from historical document images. Doc. Anal. Syst. VII 3872, 413–424 (2006)
Article Google Scholar
van der Zant T., Schomaker L., Haak K.: Handwritten word- spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1945–1957 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Raytheon BBN Technologies, Cambridge, MA, 02138, USA
Huaigu Cao
Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, 14260, USA
Venu Govindaraju & Anurag Bhardwaj

Authors

Huaigu Cao
View author publications
You can also search for this author in PubMed Google Scholar
Venu Govindaraju
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Bhardwaj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaigu Cao.

Additional information

Dr. Cao’s work presented in this article was done during the fulfillment of his Ph.D. degree in the Department of Computer Science and Engineering, University at Buffalo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, H., Govindaraju, V. & Bhardwaj, A. Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011). https://doi.org/10.1007/s10032-010-0139-z

Download citation

Received: 23 December 2009
Revised: 29 July 2010
Accepted: 25 October 2010
Published: 16 November 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10032-010-0139-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Handwritten Text Retrieval from Unlabeled Collections

Lexicon-based probabilistic indexing of handwritten text images

Approximate Search for Keywords in Handwritten Text Images

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Unconstrained handwritten document retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Handwritten Text Retrieval from Unlabeled Collections

Lexicon-based probabilistic indexing of handwritten text images

Approximate Search for Keywords in Handwritten Text Images

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation