Abstract
Handwritten documents from communities like cultural heritage, judiciary, and modern journals remain largely unexplored even today. To a great extent, this is due to the lack of retrieval tools for such unlabeled document collections. This work considers such collections and presents a simple, robust retrieval framework for easy information access. We achieve retrieval on unlabeled novel collections through invariant features learned for handwritten text. These feature representations enable zero-shot retrieval for novel queries on unlabeled collections. We improve the framework further by supporting search via text and exemplar queries. Four new collections written in English, Malayalam, and Bengali are used to evaluate our text retrieval framework. These collections comprise 2957 handwritten pages and over 300K words. We report promising results on these collections, despite the zero-shot constraint and huge collection size. Our framework allows the addition of new collections without any need for specific finetuning or labeling. Finally, we also present a demonstration of the retrieval framework. [Project Page].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Demo links available at project page.
References
Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A Survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6(1), 31–47 (2016). https://doi.org/10.1007/s13735-016-0110-y
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. TPAMI 36, 2552–2566 (2014)
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)
Das, S., Mandal, S.: Keyword spotting in historical Bangla handwritten document image using CNN. In: ICACCP (2019)
Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. PR 68, 310–332 (2017)
Gongidi, S., Jawahar, C.V.: iiit-indic-hw-words: a dataset for Indic handwritten text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 444–459. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_30
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
Krishnan, P., Dutta, K., Jawahar, C.: Word spotting and recognition using deep embedding. In: DAS (2018)
Krishnan, P., Jawahar, C.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224 (2016)
Krishnan, P., Jawahar, C.V.: HWNet v2: an efficient word image representation for handwritten documents. IJDAR 22(4), 387–405 (2019). https://doi.org/10.1007/s10032-019-00336-x
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5, 39–46 (2002). https://doi.org/10.1007/s100320200071
Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: SIGIR (2004)
Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9, 139–152 (2007). https://doi.org/10.1007/s10032-006-0027-8
Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)
Toselli, A.H., Romero, V., Sánchez, J.A., Vidal, E.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: ICDAR (2019)
Vidal, E., et al.: The CARABELA project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: ICFHR (2020)
Wilkinson, T., Brun, A.: Semantic and verbatim word spotting using deep neural networks. In: ICFHR (2016)
Wilkinson, T., Lindström, J., Brun, A.: Neural word search in historical manuscript collections. ArXiv (2018)
Acknowledgments
The authors would like to acknowledge the funding support received through IMPRINT project, Govt. of India to accomplish this project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gongidi, S., Jawahar, C.V. (2022). Handwritten Text Retrieval from Unlabeled Collections. In: Raman, B., Murala, S., Chowdhury, A., Dhall, A., Goyal, P. (eds) Computer Vision and Image Processing. CVIP 2021. Communications in Computer and Information Science, vol 1568. Springer, Cham. https://doi.org/10.1007/978-3-031-11349-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-11349-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11348-2
Online ISBN: 978-3-031-11349-9
eBook Packages: Computer ScienceComputer Science (R0)