Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063576.2063584acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Lower-bounding term frequency normalization

Published: 24 October 2011 Publication History

Abstract

In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.

References

[1]
G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002.
[2]
S. Clinchant and E. Gaussier. Information-based models for ad hoc ir. In SIGIR '10, pages 234--241, 2010.
[3]
R. Cummins and C. O'Riordan. An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions. Artif. Intell. Rev., 28:51--68, June 2007.
[4]
H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004.
[5]
H. Fang, T. Tao, and C. Zhai. Diagnostic evaluation of information retrieval models. ACM Trans. Inf. Syst., 29:7:1--7:42, April 2011.
[6]
H. Fang and C. Zhai. An exploration of axiomatic approaches to information retrieval. In SIGIR '05, pages 480--487, 2005.
[7]
H. Fang and C. Zhai. Semantic term matching in axiomatic approaches to information retrieval. In SIGIR '06, pages 115--122, 2006.
[8]
N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35:243--255, 1992.
[9]
H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev., 1:309--317, October 1957.
[10]
Y. Lv and C. Zhai. Adaptive term frequency normalization for bm25. In CIKM '11, 2011.
[11]
Y. Lv and C. Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011.
[12]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98, pages 275--281, 1998.
[13]
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, 27(3):129--146, 1976.
[14]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994.
[15]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.
[16]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613--620, 1975.
[17]
A. Singhal. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24:2001, 2001.
[18]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996.
[19]
T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR '07, pages 295--302, 2007.
[20]
C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR '01, pages 334--342, 2001.

Cited By

View all
  • (2024)An overview of the literature on assistance dogs using text mining and topic analysisFrontiers in Veterinary Science10.3389/fvets.2024.146333211Online publication date: 11-Dec-2024
  • (2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: 1-Oct-2024
  • (2024)Generative Expression Constrained Knowledge-Based Decoding for Open DataThe Semantic Web10.1007/978-3-031-60626-7_17(307-325)Online publication date: 19-May-2024
  • Show More Cited By

Index Terms

  1. Lower-bounding term frequency normalization

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
    October 2011
    2712 pages
    ISBN:9781450307178
    DOI:10.1145/2063576
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BM25+
    2. DIR+
    3. Pl2+
    4. data analysis
    5. document length
    6. formal constraints
    7. lower bound
    8. term frequency

    Qualifiers

    • Research-article

    Conference

    CIKM '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An overview of the literature on assistance dogs using text mining and topic analysisFrontiers in Veterinary Science10.3389/fvets.2024.146333211Online publication date: 11-Dec-2024
    • (2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: 1-Oct-2024
    • (2024)Generative Expression Constrained Knowledge-Based Decoding for Open DataThe Semantic Web10.1007/978-3-031-60626-7_17(307-325)Online publication date: 19-May-2024
    • (2023)Performance Comparison of Passage Retrieval Models according to Korean Language Tokenization Methods2023 15th International Conference on Advanced Computational Intelligence (ICACI)10.1109/ICACI58115.2023.10146145(1-5)Online publication date: 6-May-2023
    • (2023)The IMPACT framework and implementation for accessible in silico clinical phenotyping in the digital eranpj Digital Medicine10.1038/s41746-023-00878-96:1Online publication date: 21-Jul-2023
    • (2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-zOnline publication date: 8-Sep-2023
    • (2023)Answer Retrieval for Math Questions Using Structural and Dense RetrievalExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-42448-9_18(209-223)Online publication date: 11-Sep-2023
    • (2022)Readability of Graphical Contents on World Wide Web (WWW)2022 17th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI54924.2022.9820011(1-4)Online publication date: 22-Jun-2022
    • (2022)Leveraging Knowledge Graphs and Natural Language Processing for Automated Web Resource Labeling: Knowledge Mobilization in Neurodevelopmental Disorders. (Preprint)Journal of Medical Internet Research10.2196/45268Online publication date: 22-Dec-2022
    • (2022)Axiomatically Regularized Pre-training for Ad hoc SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531943(1524-1534)Online publication date: 6-Jul-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media