Abstract
With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)
Beitzel, S.M., Jensen, E.C., Grossman D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of the Symposium on Document Image Understanding Technologies. Greenbelt, Maryland, April 2003
Bhardwaj, A., Farooq, F., Cao, H., Govindaraju, V.: Topic based language models for OCR correction. In: Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 107–112, Singapore (2008)
Bhardwaj, A., Jose, D., Govindaraju, V.: Script independent keyword spotting for multilingual documents. In: Cross Lingual Information Access Workshop (2008)
Cao, H., Govindaraju, V.: Template-free word spotting in low- quality manuscripts. International Conference on Advances in Pattern Recognition (2007)
Cao, H., Govindaraju, V.: Handwritten carbon form preprocessing based on Markov random field. In: Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07) (2007)
Cao, H., Govindaraju, V.: Vector model based indexing and retrieval of handwritten medical forms. In: Proceedings of Ninth Internation Conference on Document Analysis and Recognition (ICDAR) 1, 88–92 (2007)
Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. In: J. Pattern Recognit. 42(12), Elsevier Press (2009)
Choi S.C., Wette R.: Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics 11(4), 683–690 (1969)
Choisy, C.: Dynamic handwritten keyword spotting based on the NSHP-HMM. In: International Conference on Document Analysis and Recognition, pp. 242–246. ICDAR (2007)
Croft, W.B., Harding, S.M., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the Symposium on Document Analysis and Information Retrieval (1994)
Frinken V., Fischer A., Bunke H.: Combining neural networks to improve performance of handwritten keyword spotting. Mult. Classif. Syst. 5997, 215–224 (2010)
Howe, N.R., Rath, T.M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the SIGIR, pp. 377–383 (2005)
Jing H.: Using hidden Markov modeling to decompose human-written summaries. Comput. Linguis. 28(4), 527–543 (2002)
Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)
Lee, D.-R., Kim, W.-Y., Oh, I.-S.: Hangul document image retrieval system using rank-based recognition. In: Proceedings of the International Conference on Document Analysis and Recognition 2, 615–619 (2005)
Milewski R., Govindaraju V., Bhardwaj A.: Automatic recognition of handwritten medical forms for search engines. Int. J Doc. Anal. Recognit. 11(4), 203–218 (2009)
Manmatha R., Han C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. Computer Vision and Pattern Recognit, p. 631. CVPR (1996)
Mittendorf, E., Schauble, P., Sheridan, P.: Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In: Research and Development in Information Retrieval, pp. 328–335 (1995)
Ohta, M., Takasu, A., Adachi, J.: Retrieval methods for English text with misrecognized OCR characters. In: Proceedings of the International Conference on Document Analysis and Recognition (1997)
Perronnin, F., Rodriguez-Serrano, J.A.: Fisher kernels for handwritten word-spotting. International Conference on Document Analysis and Recognition, pp. 106–110. ICDAR (2009)
Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (2004)
Reynolds D.A., Quatieri T.F., Dunn R.B.: Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)
Rodriguez-Serrano, J.A., Perronnin, F.: Local gradient histogram features for word spotting in unconstrained handwritten documents. In: International Conference on Frontiers in Handwriting Recognition. ICFHR (2008)
Rodriguez-Serrano J.A., Perronnin F.: Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit. 42(9), 2106–2116 (2009)
Terasawa K., Nagasaki T., Kawashima T.: Automatic keyword extraction from historical document images. Doc. Anal. Syst. VII 3872, 413–424 (2006)
van der Zant T., Schomaker L., Haak K.: Handwritten word- spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1945–1957 (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
Dr. Cao’s work presented in this article was done during the fulfillment of his Ph.D. degree in the Department of Computer Science and Engineering, University at Buffalo.
Rights and permissions
About this article
Cite this article
Cao, H., Govindaraju, V. & Bhardwaj, A. Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011). https://doi.org/10.1007/s10032-010-0139-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-010-0139-z