Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing

131 Accesses
Explore all metrics

Abstract

Zero frequency is a fundamental problem in information retrieval using language models and smoothing is applied to deal with this problem. The cluster-based smoothing method is found to be effective for information retrieval using language models. Since the effectiveness of cluster-based smoothing depends on clustering quality, there is scope for improvement by enhancing the clustering algorithm. In this paper, we present a study on how to improve cluster-based smoothing using a histogram-based incremental clustering algorithm and word embeddings. To our knowledge, this is the first study on the cluster-based smoothing method which is integrated with a language model for developing an effective IR system for the Bengali language which is one of the most spoken Indian languages. The proposed method has been tested on two benchmark Bengali IR datasets. The experimental results show that our proposed model for Bengali document retrieval is effective and it outperforms several baseline IR models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 4

BioBERT Based Efficient Clustering Framework for Biomedical Document Analysis

Trends in Document Analysis

Semantically Enhanced Text Stemmer (SETS) for Cross-Domain Document Clustering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing. Commun. ACM. 18: 613–620
Article MATH Google Scholar
Van Rijsbergen C 1979 Information retrieval: theory and practice. Proceedings Of The Joint IBM/University Of Newcastle Upon Tyne Seminar On Data Base Systems. 79
Turtle H and Croft W 1991 Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. (TOIS) 9: 187–222
Article Google Scholar
Jones K, Walker S and Robertson S 2000 A probabilistic model of information retrieval: development and comparative experiments: Part 2. Inf. Process. Manage. 36: 809–840
Article Google Scholar
Sarkar K and Gupta A 2017 An empirical study of some selected ir models for Bengali monolingual information retrieval. ArXiv Preprint ArXiv:1706.03266
Majumder P, Mitra M, Parui S, Kole G, Mitra P and Datta K 2007 YASS: Yet another suffix stripper. ACM Trans. Inf. Syst. (TOIS). 25: 18-es (2007)
Paik J and Parui S 2008 A simple stemmer for inflectional languages. Forum For Information Retrieval Evaluation
Dolamic L and Savoy J 2008 UniNE at FIRE 2008: Hindi, Bengali and Marathi I R Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12-14
Paik J, Mitra M, Parui S and Järvelin K 2011 GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. (TOIS) 29: 1–24
Article Google Scholar
Grave E, Bojanowski P, Gupta P, Joulin A and Mikolov T 2018 Learning word vectors for 157 languages. ArXiv Preprint ArXiv:1802.06893
Ponte J and Croft W 2017 A language modeling approach to information retrieval. ACM SIGIR Forum. 51: 202–208
Article Google Scholar
Liu X and Croft W 2004 Cluster-based retrieval using language models. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 186-193
Hammouda K and Kamel M 2004 Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16: 1279–1296
Article Google Scholar
Das A, Kundu B, Ghorai L, Gupta A and Chakraborti, S 2021 Anwesha: A Tool for Semantic Search in Bangla
Amati G and Van Rijsbergen C 2002 Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. (TOIS) 20: 357–389
Article Google Scholar
Bhaskar P, Das A, Pakray P and Bandyopadhyay S 2010 Theme based English and Bengali ad-hoc monolingual information retrieval in fire 2010. Corpus. 1: 25–586
Google Scholar
Loponen A, Paik J 2013 Järvelin K 2013 UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. Multilingual Inf. Access South Asian Lang.. pp. 258–268
Loponen A and Järvelin K 2010 A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 3–14
Banerjee R and Pal S 2013 ISM@ FIRE-2011 bengali monolingual task: A frequency-based stemmer. Multilingual Information Access In South Asian Languages: Second International Workshop, FIRE 2010, Gandhinagar, India, February 19–21, 2010 And Third International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers. pp. 51–58
Harman D 1991 How effective is suffixing? J. Am. Soc. Inf. Sci. 42: 321–331
Article MathSciNet Google Scholar
McNamee P 2008 N-gram tokenization for Indian language text retrieval. Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12–14
Ganguly D, Leveling J and Jones G 2013 A case study in decompounding for Bengali information retrieval. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 108-119 (2013)
Barman U, Lohar P, Bhaskar P and Bandyopadhyay S 2012 Ad-hoc information retrieval focused on wikipedia based query expansion and entropy based ranking. Corpus. 4: 57–370
Google Scholar
Chatterjee S and Sarkar K 2018 Combining IR Models for Bengali Information Retrieval. Int. J. Inf. Retrieval Res. (IJIRR). 8: 68–83
Google Scholar
Fuhr N 1992 Probabilistic models in information retrieval. Comput. J. 35: 243–255
Article MATH Google Scholar
Wong S and Yao Y 1995 On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. (TOIS) 13: 38–68
Article Google Scholar
Salton G and Buckley C 1990 Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41: 288–297
Article Google Scholar
Singhal A and Pereira F 1999 Document expansion for speech retrieval. Proceedings Of The 22nd Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 34–41
Zhai C and Lafferty J 2004 A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22: 179–214
Article Google Scholar
Jelinek F 1980 Interpolated estimation of Markov source parameters from sparse data. Proc, Workshop On Pattern Recognition In Practice
MacKay D and Peto L 1995 A hierarchical Dirichlet language model. Natl. Lang. Eng. 1: 289–308
Article Google Scholar
Kurland O and Lee L 2004 Corpus structure, language models, and ad hoc information retrieval. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 194-201

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, Jadavpur University, Kolkata, West Bengal, 700032, India
Soma Chatterjee & Kamal Sarkar

Authors

Soma Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar
Kamal Sarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamal Sarkar.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chatterjee, S., Sarkar, K. Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing. Sādhanā 48, 211 (2023). https://doi.org/10.1007/s12046-023-02258-1

Download citation

Received: 12 April 2022
Revised: 16 May 2023
Accepted: 29 June 2023
Published: 04 October 2023
DOI: https://doi.org/10.1007/s12046-023-02258-1

Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BioBERT Based Efficient Clustering Framework for Biomedical Document Analysis

Trends in Document Analysis

Semantically Enhanced Text Stemmer (SETS) for Cross-Domain Document Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BioBERT Based Efficient Clustering Framework for Biomedical Document Analysis

Trends in Document Analysis

Semantically Enhanced Text Stemmer (SETS) for Cross-Domain Document Clustering

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation