Article

Less is more: probabilistic models for retrieving fewer relevant documents

Authors:

David R. KargerAuthors Info & Claims

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 429 - 436

https://doi.org/10.1145/1148170.1148245

Published: 06 August 2006 Publication History

Abstract

Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the users information need. One example is when the user would be satisfied with some limited number of relevant documents, rather than needing all relevant documents. We show that in such a scenario, an attempt to return many relevant documents can actually reduce the chances of finding any relevant documents.

We consider a number of information retrieval metrics from the literature, including the rank of the first relevant result, the %no metric that penalizes a system only for retrieving no relevant results near the top, and the diversity of retrieved results when queries have multiple interpretations. We observe that given a probabilistic model of relevance, it is appropriate to rank so as to directly optimize these metrics in expectation. While doing so may be computationally intractable, we show that a simple greedy optimization algorithm that approximately optimizes the given objectives produces rankings for TREC queries that outperform the standard approach based on the probability ranking principle.

References

[1]

A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society for Information Science (ASIS), 34(5):331--342, 1983.

[2]

C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using smart. In Proceedings of TREC-1, pages 59--72, 1992.

[3]

J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of ACM SIGIR 1998, pages 335--336, 1998.

Digital Library

[4]

W. S. Cooper. Expected search length: A single measure of retrieval effectiveness based on weak ordering action of retrieval systems. American Documentation, 19(1):30--41, 1968.

[5]

E. Efthimiadis. Query expansion. In Annual Review of Information Systems and Technology, pages 121--187, 1996.

[6]

J. Gao, H. Qi, X. Xia, and J.-Y. Nie. Linear discriminant model for information retrieval. In Proceedings of ACM SIGIR 2005, pages 290--297, 2005.

Digital Library

[7]

D. K. Harman. Overview of the fourth text retrieval conference (trec-4). In Proceedings of TREC-4, 1995.

[8]

S. P. Harter. A probabilistic approach to automatic keyword indexing: Part i, on the distribution of specialty words in a technical literature. Journal of the ASIS, 26(4):197--206, 1975.

[9]

W. R. Hersh and P. Over. Trec-8 interactive track report. In Proceedings of TREC-8, 1999.

Digital Library

[10]

A. V. Leouski and W. B. Croft. An evaluation of techniques for clustering search results. Technical Report IR-76, University of Massachusetts, Amherst, 1996.

[11]

D. D. Lewis. Naïve (bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML 1998, pages 4--15, 1998.

Digital Library

[12]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM SIGIR 1998, pages 275--281, 1998.

Digital Library

[13]

J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of ICML 2003, pages 616--623, 2003.

[14]

S. E. Robertson. The probability ranking principle in ir. In Readings in information retrieval, pages 281--286. Morgan Kaufmann Publishers Inc., 1997.

Digital Library

[15]

C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of ACM SIGIR 2004, pages 2--9, 2004.

Digital Library

[16]

J. Teevan and D. R. Karger. Empirical development of an exponential probabilistic model for text retrieval. In Proceedings of ACM SIGIR 2003, pages 18--25, 2003.

Digital Library

[17]

E. M. Voorhees. Overview of the sixth text retrieval conference (trec-6). In Proceedings of TREC-6, 1997.

[18]

E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of ACM SIGIR 1998, pages 315--323, 1998.

Digital Library

[19]

E. M. Voorhees. Measuring ineffectiveness. In Proceedings of ACM SIGIR 2004, pages 562--563, 2004.

Digital Library

[20]

E. M. Voorhees. Overview of the trec 2004 robust retrieval track. In Proceedings of TREC 2004, 2004.

[21]

C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of ACM SIGIR 2003, pages 10--17, 2003.

Digital Library

[22]

C. Zhai and J. Lafferty. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR 2003 Workshop on Mathematical/Formal Methods in IR, 2003.

Cited By

Zhang RAi QWu YMa YLiu Y(2023)Result Diversification for Legal case RetrievalProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625319(158-168)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625319
Jain NVaddamanu PMaheshwari PVinay VKulkarni KChua TLauw HSi LTerzi ETsaparas P(2023)Self-supervised Multi-view Disentanglement for Expansion of Visual CollectionsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570425(841-849)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570425
Adeyemi I(2022)Management and Use of Metadata as a Legal Information Retrieval ToolTechnological Advancements in Library Service Innovation10.4018/978-1-7998-8942-7.ch009(154-168)Online publication date: 4-Feb-2022
https://doi.org/10.4018/978-1-7998-8942-7.ch009
Show More Cited By

Index Terms

Less is more: probabilistic models for retrieving fewer relevant documents
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Full discrimination of subtopics in search results with keyphrase-based clustering

We consider the problem of retrieving multiple documents relevant to the single subtopics of a given web query, termed “full-subtopic retrieval”. To solve this problem we present a novel search results clustering algorithm that generates clusters ...
Evaluating subtopic retrieval methods: Clustering versus diversification of search results

To address the inability of current ranking systems to support subtopic retrieval, two main post-processing techniques of search results have been investigated: clustering and diversification. In this paper we present a comparative study of their ...
Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

August 2006

768 pages

ISBN:1595933697

DOI:10.1145/1148170

General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 August 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR06

Sponsor:

SIGIR06: The 29th Annual International SIGIR Conference

August 6 - 11, 2006

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

229
Total Citations
View Citations
1,974
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang RAi QWu YMa YLiu Y(2023)Result Diversification for Legal case RetrievalProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625319(158-168)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625319
Jain NVaddamanu PMaheshwari PVinay VKulkarni KChua TLauw HSi LTerzi ETsaparas P(2023)Self-supervised Multi-view Disentanglement for Expansion of Visual CollectionsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570425(841-849)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570425
Adeyemi I(2022)Management and Use of Metadata as a Legal Information Retrieval ToolTechnological Advancements in Library Service Innovation10.4018/978-1-7998-8942-7.ch009(154-168)Online publication date: 4-Feb-2022
https://doi.org/10.4018/978-1-7998-8942-7.ch009
Porcaro LGómez ECastillo C(2022)Perceptions of Diversity in Electronic Music: the Impact of Listener, Artist, and Track CharacteristicsProceedings of the ACM on Human-Computer Interaction10.1145/35129566:CSCW1(1-26)Online publication date: 7-Apr-2022
https://dl.acm.org/doi/10.1145/3512956
Wang YLuo LLiu H(2022)Bridging the Semantic Gap Between Customer Needs and Design Specifications Using User-Generated ContentIEEE Transactions on Engineering Management10.1109/TEM.2020.302169869:4(1622-1634)Online publication date: Aug-2022
https://doi.org/10.1109/TEM.2020.3021698
Ni YOuyang SLi LPan WMing Z(2022)Collaborative filtering with implicit feedback via learning pairwise preferences over user-groups and item-setsCCF Transactions on Pervasive Computing and Interaction10.1007/s42486-021-00086-y4:1(32-44)Online publication date: 8-Jan-2022
https://doi.org/10.1007/s42486-021-00086-y
Wu ZLiu YMao JZhang MMa S(2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s11390-022-2031-y
Wahabi fasadi sGanjehtar S(2021)Neural Network Modelling of the Information Behavior of Database Users Based on their Previous Interactions with the Search ResultsIranian Journal of Information Processing and Management10.52547/jipm.37.1.25537:1(255-276)Online publication date: 1-Sep-2021
https://doi.org/10.52547/jipm.37.1.255
Wang YLi XMo D(2021)Knowledge-Empowered Multitask Learning to Address the Semantic Gap Between Customer Needs and Design SpecificationsIEEE Transactions on Industrial Informatics10.1109/TII.2021.306714117:12(8397-8405)Online publication date: Dec-2021
https://doi.org/10.1109/TII.2021.3067141
Wu CWang YMa J(2021)Maximal Marginal Relevance-Based Recommendation for Product CustomisationEnterprise Information Systems10.1080/17517575.2021.199201817:5Online publication date: 24-Oct-2021
https://doi.org/10.1080/17517575.2021.1992018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten