Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1148170.1148245acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Less is more: probabilistic models for retrieving fewer relevant documents

Published: 06 August 2006 Publication History

Abstract

Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the users information need. One example is when the user would be satisfied with some limited number of relevant documents, rather than needing all relevant documents. We show that in such a scenario, an attempt to return many relevant documents can actually reduce the chances of finding any relevant documents.
We consider a number of information retrieval metrics from the literature, including the rank of the first relevant result, the %no metric that penalizes a system only for retrieving no relevant results near the top, and the diversity of retrieved results when queries have multiple interpretations. We observe that given a probabilistic model of relevance, it is appropriate to rank so as to directly optimize these metrics in expectation. While doing so may be computationally intractable, we show that a simple greedy optimization algorithm that approximately optimizes the given objectives produces rankings for TREC queries that outperform the standard approach based on the probability ranking principle.

References

[1]
A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society for Information Science (ASIS), 34(5):331--342, 1983.
[2]
C. Buckley, G. Salton, and J. Allan. Automatic retrieval with locality information using smart. In Proceedings of TREC-1, pages 59--72, 1992.
[3]
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of ACM SIGIR 1998, pages 335--336, 1998.
[4]
W. S. Cooper. Expected search length: A single measure of retrieval effectiveness based on weak ordering action of retrieval systems. American Documentation, 19(1):30--41, 1968.
[5]
E. Efthimiadis. Query expansion. In Annual Review of Information Systems and Technology, pages 121--187, 1996.
[6]
J. Gao, H. Qi, X. Xia, and J.-Y. Nie. Linear discriminant model for information retrieval. In Proceedings of ACM SIGIR 2005, pages 290--297, 2005.
[7]
D. K. Harman. Overview of the fourth text retrieval conference (trec-4). In Proceedings of TREC-4, 1995.
[8]
S. P. Harter. A probabilistic approach to automatic keyword indexing: Part i, on the distribution of specialty words in a technical literature. Journal of the ASIS, 26(4):197--206, 1975.
[9]
W. R. Hersh and P. Over. Trec-8 interactive track report. In Proceedings of TREC-8, 1999.
[10]
A. V. Leouski and W. B. Croft. An evaluation of techniques for clustering search results. Technical Report IR-76, University of Massachusetts, Amherst, 1996.
[11]
D. D. Lewis. Naïve (bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML 1998, pages 4--15, 1998.
[12]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM SIGIR 1998, pages 275--281, 1998.
[13]
J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of ICML 2003, pages 616--623, 2003.
[14]
S. E. Robertson. The probability ranking principle in ir. In Readings in information retrieval, pages 281--286. Morgan Kaufmann Publishers Inc., 1997.
[15]
C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of ACM SIGIR 2004, pages 2--9, 2004.
[16]
J. Teevan and D. R. Karger. Empirical development of an exponential probabilistic model for text retrieval. In Proceedings of ACM SIGIR 2003, pages 18--25, 2003.
[17]
E. M. Voorhees. Overview of the sixth text retrieval conference (trec-6). In Proceedings of TREC-6, 1997.
[18]
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of ACM SIGIR 1998, pages 315--323, 1998.
[19]
E. M. Voorhees. Measuring ineffectiveness. In Proceedings of ACM SIGIR 2004, pages 562--563, 2004.
[20]
E. M. Voorhees. Overview of the trec 2004 robust retrieval track. In Proceedings of TREC 2004, 2004.
[21]
C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of ACM SIGIR 2003, pages 10--17, 2003.
[22]
C. Zhai and J. Lafferty. A risk minimization framework for information retrieval. In Proceedings of the ACM SIGIR 2003 Workshop on Mathematical/Formal Methods in IR, 2003.

Cited By

View all
  • (2023)Result Diversification for Legal case RetrievalProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625319(158-168)Online publication date: 26-Nov-2023
  • (2023)Self-supervised Multi-view Disentanglement for Expansion of Visual CollectionsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570425(841-849)Online publication date: 27-Feb-2023
  • (2022)Management and Use of Metadata as a Legal Information Retrieval ToolTechnological Advancements in Library Service Innovation10.4018/978-1-7998-8942-7.ch009(154-168)Online publication date: 4-Feb-2022
  • Show More Cited By

Index Terms

  1. Less is more: probabilistic models for retrieving fewer relevant documents

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2006
    768 pages
    ISBN:1595933697
    DOI:10.1145/1148170
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. formal models
    2. information retrieval
    3. machine learning
    4. subtopic retrieval

    Qualifiers

    • Article

    Conference

    SIGIR06
    Sponsor:
    SIGIR06: The 29th Annual International SIGIR Conference
    August 6 - 11, 2006
    Washington, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Result Diversification for Legal case RetrievalProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625319(158-168)Online publication date: 26-Nov-2023
    • (2023)Self-supervised Multi-view Disentanglement for Expansion of Visual CollectionsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570425(841-849)Online publication date: 27-Feb-2023
    • (2022)Management and Use of Metadata as a Legal Information Retrieval ToolTechnological Advancements in Library Service Innovation10.4018/978-1-7998-8942-7.ch009(154-168)Online publication date: 4-Feb-2022
    • (2022)Perceptions of Diversity in Electronic Music: the Impact of Listener, Artist, and Track CharacteristicsProceedings of the ACM on Human-Computer Interaction10.1145/35129566:CSCW1(1-26)Online publication date: 7-Apr-2022
    • (2022)Bridging the Semantic Gap Between Customer Needs and Design Specifications Using User-Generated ContentIEEE Transactions on Engineering Management10.1109/TEM.2020.302169869:4(1622-1634)Online publication date: Aug-2022
    • (2022)Collaborative filtering with implicit feedback via learning pairwise preferences over user-groups and item-setsCCF Transactions on Pervasive Computing and Interaction10.1007/s42486-021-00086-y4:1(32-44)Online publication date: 8-Jan-2022
    • (2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
    • (2021)Neural Network Modelling of the Information Behavior of Database Users Based on their Previous Interactions with the Search ResultsIranian Journal of Information Processing and Management10.52547/jipm.37.1.25537:1(255-276)Online publication date: 1-Sep-2021
    • (2021)Knowledge-Empowered Multitask Learning to Address the Semantic Gap Between Customer Needs and Design SpecificationsIEEE Transactions on Industrial Informatics10.1109/TII.2021.306714117:12(8397-8405)Online publication date: Dec-2021
    • (2021)Maximal Marginal Relevance-Based Recommendation for Product CustomisationEnterprise Information Systems10.1080/17517575.2021.199201817:5Online publication date: 24-Oct-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media