Abstract
Zero frequency is a fundamental problem in information retrieval using language models and smoothing is applied to deal with this problem. The cluster-based smoothing method is found to be effective for information retrieval using language models. Since the effectiveness of cluster-based smoothing depends on clustering quality, there is scope for improvement by enhancing the clustering algorithm. In this paper, we present a study on how to improve cluster-based smoothing using a histogram-based incremental clustering algorithm and word embeddings. To our knowledge, this is the first study on the cluster-based smoothing method which is integrated with a language model for developing an effective IR system for the Bengali language which is one of the most spoken Indian languages. The proposed method has been tested on two benchmark Bengali IR datasets. The experimental results show that our proposed model for Bengali document retrieval is effective and it outperforms several baseline IR models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing. Commun. ACM. 18: 613–620
Van Rijsbergen C 1979 Information retrieval: theory and practice. Proceedings Of The Joint IBM/University Of Newcastle Upon Tyne Seminar On Data Base Systems. 79
Turtle H and Croft W 1991 Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. (TOIS) 9: 187–222
Jones K, Walker S and Robertson S 2000 A probabilistic model of information retrieval: development and comparative experiments: Part 2. Inf. Process. Manage. 36: 809–840
Sarkar K and Gupta A 2017 An empirical study of some selected ir models for Bengali monolingual information retrieval. ArXiv Preprint ArXiv:1706.03266
Majumder P, Mitra M, Parui S, Kole G, Mitra P and Datta K 2007 YASS: Yet another suffix stripper. ACM Trans. Inf. Syst. (TOIS). 25: 18-es (2007)
Paik J and Parui S 2008 A simple stemmer for inflectional languages. Forum For Information Retrieval Evaluation
Dolamic L and Savoy J 2008 UniNE at FIRE 2008: Hindi, Bengali and Marathi I R Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12-14
Paik J, Mitra M, Parui S and Järvelin K 2011 GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. (TOIS) 29: 1–24
Grave E, Bojanowski P, Gupta P, Joulin A and Mikolov T 2018 Learning word vectors for 157 languages. ArXiv Preprint ArXiv:1802.06893
Ponte J and Croft W 2017 A language modeling approach to information retrieval. ACM SIGIR Forum. 51: 202–208
Liu X and Croft W 2004 Cluster-based retrieval using language models. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 186-193
Hammouda K and Kamel M 2004 Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16: 1279–1296
Das A, Kundu B, Ghorai L, Gupta A and Chakraborti, S 2021 Anwesha: A Tool for Semantic Search in Bangla
Amati G and Van Rijsbergen C 2002 Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. (TOIS) 20: 357–389
Bhaskar P, Das A, Pakray P and Bandyopadhyay S 2010 Theme based English and Bengali ad-hoc monolingual information retrieval in fire 2010. Corpus. 1: 25–586
Loponen A, Paik J 2013 Järvelin K 2013 UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. Multilingual Inf. Access South Asian Lang.. pp. 258–268
Loponen A and Järvelin K 2010 A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 3–14
Banerjee R and Pal S 2013 ISM@ FIRE-2011 bengali monolingual task: A frequency-based stemmer. Multilingual Information Access In South Asian Languages: Second International Workshop, FIRE 2010, Gandhinagar, India, February 19–21, 2010 And Third International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers. pp. 51–58
Harman D 1991 How effective is suffixing? J. Am. Soc. Inf. Sci. 42: 321–331
McNamee P 2008 N-gram tokenization for Indian language text retrieval. Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12–14
Ganguly D, Leveling J and Jones G 2013 A case study in decompounding for Bengali information retrieval. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 108-119 (2013)
Barman U, Lohar P, Bhaskar P and Bandyopadhyay S 2012 Ad-hoc information retrieval focused on wikipedia based query expansion and entropy based ranking. Corpus. 4: 57–370
Chatterjee S and Sarkar K 2018 Combining IR Models for Bengali Information Retrieval. Int. J. Inf. Retrieval Res. (IJIRR). 8: 68–83
Fuhr N 1992 Probabilistic models in information retrieval. Comput. J. 35: 243–255
Wong S and Yao Y 1995 On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. (TOIS) 13: 38–68
Salton G and Buckley C 1990 Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41: 288–297
Singhal A and Pereira F 1999 Document expansion for speech retrieval. Proceedings Of The 22nd Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 34–41
Zhai C and Lafferty J 2004 A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22: 179–214
Jelinek F 1980 Interpolated estimation of Markov source parameters from sparse data. Proc, Workshop On Pattern Recognition In Practice
MacKay D and Peto L 1995 A hierarchical Dirichlet language model. Natl. Lang. Eng. 1: 289–308
Kurland O and Lee L 2004 Corpus structure, language models, and ad hoc information retrieval. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 194-201
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chatterjee, S., Sarkar, K. Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing. Sādhanā 48, 211 (2023). https://doi.org/10.1007/s12046-023-02258-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-023-02258-1