Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

Zero frequency is a fundamental problem in information retrieval using language models and smoothing is applied to deal with this problem. The cluster-based smoothing method is found to be effective for information retrieval using language models. Since the effectiveness of cluster-based smoothing depends on clustering quality, there is scope for improvement by enhancing the clustering algorithm. In this paper, we present a study on how to improve cluster-based smoothing using a histogram-based incremental clustering algorithm and word embeddings. To our knowledge, this is the first study on the cluster-based smoothing method which is integrated with a language model for developing an effective IR system for the Bengali language which is one of the most spoken Indian languages. The proposed method has been tested on two benchmark Bengali IR datasets. The experimental results show that our proposed model for Bengali document retrieval is effective and it outperforms several baseline IR models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing. Commun. ACM. 18: 613–620

    Article  MATH  Google Scholar 

  2. Van Rijsbergen C 1979 Information retrieval: theory and practice. Proceedings Of The Joint IBM/University Of Newcastle Upon Tyne Seminar On Data Base Systems. 79

  3. Turtle H and Croft W 1991 Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. (TOIS) 9: 187–222

    Article  Google Scholar 

  4. Jones K, Walker S and Robertson S 2000 A probabilistic model of information retrieval: development and comparative experiments: Part 2. Inf. Process. Manage. 36: 809–840

    Article  Google Scholar 

  5. Sarkar K and Gupta A 2017 An empirical study of some selected ir models for Bengali monolingual information retrieval. ArXiv Preprint ArXiv:1706.03266

  6. Majumder P, Mitra M, Parui S, Kole G, Mitra P and Datta K 2007 YASS: Yet another suffix stripper. ACM Trans. Inf. Syst. (TOIS). 25: 18-es (2007)

  7. Paik J and Parui S 2008 A simple stemmer for inflectional languages. Forum For Information Retrieval Evaluation

  8. Dolamic L and Savoy J 2008 UniNE at FIRE 2008: Hindi, Bengali and Marathi I R Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12-14

  9. Paik J, Mitra M, Parui S and Järvelin K 2011 GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. (TOIS) 29: 1–24

    Article  Google Scholar 

  10. Grave E, Bojanowski P, Gupta P, Joulin A and Mikolov T 2018 Learning word vectors for 157 languages. ArXiv Preprint ArXiv:1802.06893

  11. Ponte J and Croft W 2017 A language modeling approach to information retrieval. ACM SIGIR Forum. 51: 202–208

    Article  Google Scholar 

  12. Liu X and Croft W 2004 Cluster-based retrieval using language models. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 186-193

  13. Hammouda K and Kamel M 2004 Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16: 1279–1296

    Article  Google Scholar 

  14. Das A, Kundu B, Ghorai L, Gupta A and Chakraborti, S 2021 Anwesha: A Tool for Semantic Search in Bangla

  15. Amati G and Van Rijsbergen C 2002 Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. (TOIS) 20: 357–389

    Article  Google Scholar 

  16. Bhaskar P, Das A, Pakray P and Bandyopadhyay S 2010 Theme based English and Bengali ad-hoc monolingual information retrieval in fire 2010. Corpus. 1: 25–586

    Google Scholar 

  17. Loponen A, Paik J 2013 Järvelin K 2013 UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. Multilingual Inf. Access South Asian Lang.. pp. 258–268

  18. Loponen A and Järvelin K 2010 A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 3–14

  19. Banerjee R and Pal S 2013 ISM@ FIRE-2011 bengali monolingual task: A frequency-based stemmer. Multilingual Information Access In South Asian Languages: Second International Workshop, FIRE 2010, Gandhinagar, India, February 19–21, 2010 And Third International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers. pp. 51–58

  20. Harman D 1991 How effective is suffixing? J. Am. Soc. Inf. Sci. 42: 321–331

    Article  MathSciNet  Google Scholar 

  21. McNamee P 2008 N-gram tokenization for Indian language text retrieval. Working Notes Of The Forum For Information Retrieval Evaluation. pp. 12–14

  22. Ganguly D, Leveling J and Jones G 2013 A case study in decompounding for Bengali information retrieval. International Conference Of The Cross-Language Evaluation Forum For European Languages. pp. 108-119 (2013)

  23. Barman U, Lohar P, Bhaskar P and Bandyopadhyay S 2012 Ad-hoc information retrieval focused on wikipedia based query expansion and entropy based ranking. Corpus. 4: 57–370

    Google Scholar 

  24. Chatterjee S and Sarkar K 2018 Combining IR Models for Bengali Information Retrieval. Int. J. Inf. Retrieval Res. (IJIRR). 8: 68–83

    Google Scholar 

  25. Fuhr N 1992 Probabilistic models in information retrieval. Comput. J. 35: 243–255

    Article  MATH  Google Scholar 

  26. Wong S and Yao Y 1995 On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. (TOIS) 13: 38–68

    Article  Google Scholar 

  27. Salton G and Buckley C 1990 Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41: 288–297

    Article  Google Scholar 

  28. Singhal A and Pereira F 1999 Document expansion for speech retrieval. Proceedings Of The 22nd Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 34–41

  29. Zhai C and Lafferty J 2004 A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22: 179–214

    Article  Google Scholar 

  30. Jelinek F 1980 Interpolated estimation of Markov source parameters from sparse data. Proc, Workshop On Pattern Recognition In Practice

  31. MacKay D and Peto L 1995 A hierarchical Dirichlet language model. Natl. Lang. Eng. 1: 289–308

    Article  Google Scholar 

  32. Kurland O and Lee L 2004 Corpus structure, language models, and ad hoc information retrieval. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 194-201

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamal Sarkar.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chatterjee, S., Sarkar, K. Bengali document retrieval using a language modeling approach enhanced by improved cluster-based smoothing. Sādhanā 48, 211 (2023). https://doi.org/10.1007/s12046-023-02258-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-023-02258-1

Keywords

Navigation