Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1645953.1646237acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

A machine learning approach for improved BM25 retrieval

Published: 02 November 2009 Publication History

Abstract

Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.

References

[1]
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 19--26, 2006.
[2]
C. Burges, R. Ragno, and Q. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS), 2006. See also MSR Technical Report MSR-TR-2006-60.
[3]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In International Conference on Machine Learning (ICML), Bonn, Germany, 2005.
[4]
N. Craswell and D. Hawking. Overview of the TREC 2004 web track. In Proceedings of TREC 2004, 2004.
[5]
N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 web track. In Proceedings of TREC 2003, 2003.
[6]
N. Craswell and M. Szummer. Random walk on the click graph. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007.
[7]
P. Donmez, K. Svore, and C. Burges. On the local optimality of LambdaRank. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009.
[8]
J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009.
[9]
B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Transactions on Information Systems (TOIS), 25(3):13, 2007.
[10]
K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 41--48, 2000.
[11]
T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD, pages 133--142, 2002.
[12]
D. Metzler. Generalized inverse document frequency. In ACM Conference on Information Knowledge Management (CIKM), 2008.
[13]
P. Ogilvie and J. Callan. Combining document representations for known item search. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2003.
[14]
Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In Proceedings of the 25th European Conference on IR Research (ECIR), 2003.
[15]
S. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 345--354, 1994.
[16]
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In ACM Conference on Information Knowledge Management (CIKM), pages 42--49, 2004.
[17]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 21--29, 1996.
[18]
R. Song, M. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term proximity from a different perspective. Advances in Information Retrieval, Lecture Notes in Computer Science, 4956/2008:346--357, 2008.
[19]
K. Sparck-Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 36:809--840, 2000.
[20]
K. Svore and C. Burges. A machine learning approach improved bm25 retrieval. Microsoft Technical Report MSR-TR-2009-92, 2009.
[21]
M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In ACM Conference on Information Knowledge Management (CIKM), 2006.
[22]
R. Wilkinson. Effective retrieval of structured documents. In Research and Development in Information Retrieval, pages 311--317, 1994.
[23]
Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Microsoft Technical Report MSR-TR-2008-109, 2008.
[24]
G. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using web click-through information. In ACM Conference on Information Knowledge Management (CIKM), 2004.
[25]
Y. Yue and C. Burges. On using simultaneous perturbation stochastic approximation for IR measures, and the empirical optimality of LambdaRank. NIPS Machine Learning for Web Search Workshop, 2007.

Cited By

View all
  • (2024)Social Network Forensics Analysis Model Based on Network Representation LearningEntropy10.3390/e2607057926:7(579)Online publication date: 7-Jul-2024
  • (2023)Beyond Semantics: Learning a Behavior Augmented Relevance Model with Self-supervised LearningProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615457(4516-4522)Online publication date: 21-Oct-2023
  • (2023)Generating textual emergency plans for unconventional emergencies — A natural language processing approachSafety Science10.1016/j.ssci.2022.106047160(106047)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. A machine learning approach for improved BM25 retrieval

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
      November 2009
      2162 pages
      ISBN:9781605585123
      DOI:10.1145/1645953
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 November 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bm25
      2. learning to rank
      3. retrieval models
      4. web search

      Qualifiers

      • Poster

      Conference

      CIKM '09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)83
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 23 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Social Network Forensics Analysis Model Based on Network Representation LearningEntropy10.3390/e2607057926:7(579)Online publication date: 7-Jul-2024
      • (2023)Beyond Semantics: Learning a Behavior Augmented Relevance Model with Self-supervised LearningProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615457(4516-4522)Online publication date: 21-Oct-2023
      • (2023)Generating textual emergency plans for unconventional emergencies — A natural language processing approachSafety Science10.1016/j.ssci.2022.106047160(106047)Online publication date: Apr-2023
      • (2022)Dense-to-Question and Sparse-to-Answer: Hybrid Retriever System for Industrial Frequently Asked QuestionsMathematics10.3390/math1008133510:8(1335)Online publication date: 18-Apr-2022
      • (2022)Graph-based Similarity for Document Retrieval in the Biomedical DomainProceedings of the 2022 7th International Conference on Machine Learning Technologies10.1145/3529399.3529428(180-184)Online publication date: 11-Mar-2022
      • (2022)PU-GEN: Enhancing generative commonsense reasoning for language models with human-centered knowledgeKnowledge-Based Systems10.1016/j.knosys.2022.109861256(109861)Online publication date: Nov-2022
      • (2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
      • (2021)GLOW : Global Weighted Self-Attention Network for Web Search2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671546(519-528)Online publication date: 15-Dec-2021
      • (2021)A Page-topic Relevance Algorithm Based on BM25 and Paragraph-Semantic CorrelationJournal of Physics: Conference Series10.1088/1742-6596/1757/1/0121151757:1(012115)Online publication date: 1-Jan-2021
      • (2021)Effects of Stratification and Preheat on Turbulent Flame Characteristics and StabilizationFlow, Turbulence and Combustion10.1007/s10494-021-00267-wOnline publication date: 11-May-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media