short-paper

Verboseness Fission for BM25 Document Length Normalization

Authors:

Akiko AizawaAuthors Info & Claims

ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

Pages 385 - 388

https://doi.org/10.1145/2808194.2809486

Published: 27 September 2015 Publication History

Abstract

BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k₁, b, and k₃). This paper addresses b - the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.

References

[1]

G. Amati and J. C. C. Van Rijsbergen. Probabilistic models for information retrieval based on divergence from randomness. TOIS, 20(4), 2002.

Digital Library

[2]

A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document Normalization Revisited. In Proc. of SIGIR, 2002.

Digital Library

[3]

D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In Proc. of TREC 4, 1995.

[4]

B. He and I. Ounis. A Study of Parameter Tuning for Term Frequency Normalization. In Proc. of CIKM, 2003.

Digital Library

[5]

B. He and I. Ounis. A Study of the Dirichlet Priors for Term Frequency Normalisation. In Proc. of SIGIR, 2005.

Digital Library

[6]

B. He and I. Ounis. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Proc. of ECIR, 2005.

Digital Library

[7]

Y. Lv and C. Zhai. Adaptive Term Frequency Normalization for BM25. In Proc. of CIKM, 2011.

Digital Library

[8]

Y. Lv and C. Zhai. Lower-bounding Term Frequency Normalization. In Proc. of CIKM, 2011.

Digital Library

[9]

Y. Lv and C. Zhai. When Documents Are Very Long, BM25 Fails! In Proc. of SIGIR, 2011.

Digital Library

[10]

D. Metzler and H. Zaragoza. Semi-parametric and non-parametric term weighting for information retrieval. In Proc. of ICTIR, 2009.

Digital Library

[11]

S.-H. Na, I.-S. Kang, and J.-H. Lee. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Proc. of ECIR, 2008.

Digital Library

[12]

S. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne. Okapi at TREC-4. In Proc. of TREC 4, 1995.

[13]

S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 2009.

Digital Library

[14]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. of TREC-3, 1994.

[15]

F. Rousseau and M. Vazirgiannis. Composition of TF Normalizations: New Insights on Scoring Functions for Ad Hoc IR. In Proc. of SIGIR, 2013.

Digital Library

[16]

T. Sakai. Alternatives to Bpref. In Proc. of SIGIR, 2007.

Digital Library

[17]

A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In Proc. of SIGIR, 1996.

Digital Library

Cited By

Askari AVerberne SAbolghasemi AKraaij WPasi G(2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631938
Zehtab GBasiri A(2022)Employees Turnover Rate with Pivoted Length Normalization2022 27th International Computer Conference, Computer Society of Iran (CSICC)10.1109/CSICC55295.2022.9780489(1-4)Online publication date: 23-Feb-2022
https://doi.org/10.1109/CSICC55295.2022.9780489
Marchesin SDi Nunzio GAgosti M(2021)Simple but Effective Knowledge-Based Query Reformulations for Precision Medicine RetrievalInformation10.3390/info1210040212:10(402)Online publication date: 29-Sep-2021
https://doi.org/10.3390/info12100402
Show More Cited By

Index Terms

Verboseness Fission for BM25 Document Length Normalization
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Pivoted Document Length Normalization
SIGIR Test-of-Time Awardees 1978-2001

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that ...
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Document Length Normalization

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

September 2015

402 pages

ISBN:9781450338332

DOI:10.1145/2808194

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Bruce Croft
University of Massachusetts Amherst, USA
,
Program Chairs:
Arjen de Vries
CWI Amsterdam, The Netherlands
,
Chengxiang Zhai
University of Illinois at Urbana-Champaign, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper

Conference

ICTIR '15

Sponsor:

SIGIR

ICTIR '15: ACM SIGIR International Conference on the Theory of Information Retrieval

September 27 - 30, 2015

Massachusetts, Northampton, USA

Acceptance Rates

ICTIR '15 Paper Acceptance Rate 29 of 57 submissions, 51%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Askari AVerberne SAbolghasemi AKraaij WPasi G(2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631938
Zehtab GBasiri A(2022)Employees Turnover Rate with Pivoted Length Normalization2022 27th International Computer Conference, Computer Society of Iran (CSICC)10.1109/CSICC55295.2022.9780489(1-4)Online publication date: 23-Feb-2022
https://doi.org/10.1109/CSICC55295.2022.9780489
Marchesin SDi Nunzio GAgosti M(2021)Simple but Effective Knowledge-Based Query Reformulations for Precision Medicine RetrievalInformation10.3390/info1210040212:10(402)Online publication date: 29-Sep-2021
https://doi.org/10.3390/info12100402
Muntean CNardini FPerego RTonellotto NFrieder O(2020)Weighting Passages Enhances AccuracyACM Transactions on Information Systems10.1145/342868739:2(1-11)Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1145/3428687
Papariello LBampoulidis ALupu M(2020)On the Replicability of Combining Word Embeddings and Retrieval ModelsAdvances in Information Retrieval10.1007/978-3-030-45442-5_7(50-57)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45442-5_7
Lipani A(2019)On Biases in Information Retrieval Models and EvaluationACM SIGIR Forum10.1145/3308774.330880452:2(172-173)Online publication date: 17-Jan-2019
https://dl.acm.org/doi/10.1145/3308774.3308804
Jian FHuang JZhao JYing ZWang Y(2019)A topic‐based term frequency normalization framework to enhance probabilistic information retrievalComputational Intelligence10.1111/coin.1224836:2(486-521)Online publication date: 20-Nov-2019
https://doi.org/10.1111/coin.12248
Lipani ARoelleke TLupu MHanbury A(2018)A systematic approach to normalization in probabilistic modelsInformation Retrieval Journal10.1007/s10791-018-9334-121:6(565-596)Online publication date: 30-Jun-2018
https://doi.org/10.1007/s10791-018-9334-1
Yamada YHimeno YNakatoh T(2018)Weighting of Noun Phrases Based on Local Frequency of NounsRecent Advances on Soft Computing and Data Mining10.1007/978-3-319-72550-5_42(436-445)Online publication date: 12-Jan-2018
https://doi.org/10.1007/978-3-319-72550-5_42
Azzopardi JBenedetti FGuerra FLupu M(2017)Back to the Sketch-Board: Integrating Keyword Search, Semantics, and Information RetrievalSemantic Keyword-Based Search on Structured Data Sources10.1007/978-3-319-53640-8_5(49-61)Online publication date: 15-Feb-2017
https://doi.org/10.1007/978-3-319-53640-8_5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents