research-article

Term necessity prediction

Authors:

Jamie CallanAuthors Info & Claims

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 259 - 268

https://doi.org/10.1145/1871437.1871474

Published: 26 October 2010 Publication History

Abstract

The probability that a term appears in relevant documents (P(t | R)) is a fundamental quantity in several probabilistic retrieval models, however it is difficult to estimate without relevance judgments or a relevance model. We call this value term necessity because it measures the percentage of relevant documents retrieved by the term - how necessary a term's occurrence is to document relevance. Prior research typically either set this probability to a constant, or estimated it based on the term's inverse document frequency, neither of which was very effective.

This paper identifies several factors that affect term necessity, for example, a term's topic centrality, synonymy and abstractness. It develops term- and query-dependent features for each factor that enable supervised learning of a predictive model of term necessity from training data. Experiments with two popular retrieval models and 6 standard datasets demonstrate that using predicted term necessity estimates as user term weights of the original query terms leads to significant improvements in retrieval accuracy.

References

[1]

S. E. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146. 1976.

[2]

W. Greiff. A theory of term weighting based on exploratory data analysis. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 11--19, 1998.

Digital Library

[3]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). 109--126. Gaithersburg, USA, November 1994.

[4]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281, 1998.

Digital Library

[5]

INDRI - Language modeling meets inference networks. http://www.lemurproject.org/indri/. Retrieved Oct 1, 2009.

[6]

M. Lease, J. Allan and W. B. Croft. Regression rank: Learning to meet the opportunity of descriptive queries. In Proceedings of the 31st European Conference on Information Retrieval (ECIR). 90--101, 2009.

Digital Library

[7]

W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285--295, December 1979.

[8]

C. T. Yu, K. Lam, and G. Salton. Term weighting in information retrieval using the term precision model. Journal of the ACM, 29(1):152--170, January 1982

Digital Library

[9]

S. E. Robertson. On relevance weight estimation and query expansion. Journal of Documentation, 42(3): 182--188, 1986.

[10]

S. Cronen-Townsend, Y. Zhou and W. B. Croft. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 299--306, 2002.

Digital Library

[11]

M. Bendersky, W. B. Croft. Discovering key concepts in verbose queries. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 491--498, 2008.

Digital Library

[12]

G. Kumaran and V. Carvalho. Reducing long queries using query quality predictors. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 564--571, 2009.

Digital Library

[13]

Y. Lu, H. Fang and C. Zhai. An empirical study of gene synonym query expansion in biomedical information retrieval. Information Retrieval, 12(1): 51--68, 2009.

Digital Library

[14]

D. Metzler. Generalized inverse document frequency. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. 399--408, 2008.

Digital Library

[15]

M. D. Smucker, J. Allan and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. 623--632, 2007.

Digital Library

[16]

J. Allan, M. Connell, W. B. Croft, F. Feng, D. Fisher and X. Li. INQUERY and TREC-9. In Proceedings of the Ninth Text REtrieval Conference (TREC 2002). 551--600, 2000.

[17]

V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 120--127, 2001.

Digital Library

[18]

L. Zhao and J. Callan. Effective and efficient structured retrieval (poster description). In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1573--1576, 2009.

Digital Library

[19]

H. Schütze, D. A. Hull and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 229--237, 1995.

Digital Library

[20]

A. Kontostathis and W. M. Pottenger. Detecting patterns in the LSI term-term matrix. IEEE ICDM02 Workshop Proceedings, The Foundation of Data Mining and Knowledge Discovery (FDM). 2002.

[21]

C.J. van Rijsbergen. Information Retrieval (2nd Edition), chapter 6. Butterworths. London 1979.

Digital Library

[22]

R. Lawlor. Information technology and the law. Advances in Computers, 3: 299--346, 1962.

[23]

G. Goertz and H. Starr (eds.) Necessary conditions: theory, methodology, and applications. Lanham, Md.: Rowman & Littlefield 2002. page 10.

[24]

N. Fuhr and C. Buckley. A probabilistic learning approach for document indexing. ACM Transactions on Information Systems 9(3):223--248. 1991.

Digital Library

[25]

W. Cooper, A. Chen and F. Gey. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. NIST Special Publication 500--215: The Second Text REtrieval Conference (TREC-2). 57--66, 1993.

[26]

V. Dang and W. B. Croft. Query reformulation using anchor text. In Proceedings of the third ACM International Conference on Web Search and Data Mining. 41--50, 2010.

Digital Library

[27]

D. Metzler, V. Lavrenko and W. B. Croft. Formal Multiple-Bernoulli Models for Language Modeling. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 540--541, 2004.

Digital Library

Cited By

Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Pimentel ADíaz OVillaseñor EJiménez J(2023)First steps towards improving official statistics data accessibility in Mexico: Query expansion with neural networks and ad-hoc space vectorsStatistical Journal of the IAOS10.3233/SJI-23001439:3(745-754)Online publication date: 12-Sep-2023
https://doi.org/10.3233/SJI-230014
Show More Cited By

Index Terms

Term necessity prediction
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Context-Aware Term Weighting For First Stage Passage Retrieval
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Term frequency is a common method for identifying the importance of a term in a document. But term frequency ignores how a term interacts with its text context, which is key to estimating document-specific term weights. This paper proposes a Deep ...
Context-Aware Document Term Weighting for Ad-Hoc Search
WWW '20: Proceedings of The Web Conference 2020

Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting framework for ...
Term weighting for information retrieval based on term's discrimination power

One of the most important research topics in Information Retrieval is term weighting for document ranking and retrieval, such as TFIDF, BM25, etc. We propose a term weighting method that utilizes past retrieval results consisting of the queries that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

October 2010

2036 pages

ISBN:9781450300995

DOI:10.1145/1871437

General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '10

Sponsor:

CIKM '10: International Conference on Information and Knowledge Management

October 26 - 30, 2010

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
503
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Pimentel ADíaz OVillaseñor EJiménez J(2023)First steps towards improving official statistics data accessibility in Mexico: Query expansion with neural networks and ad-hoc space vectorsStatistical Journal of the IAOS10.3233/SJI-23001439:3(745-754)Online publication date: 12-Sep-2023
https://doi.org/10.3233/SJI-230014
He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3570725
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Radlinski FBalog KDiaz FDixon LWedin BAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)On Natural Language User Profiles for Transparent and Scrutable RecommendationProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531873(2863-2874)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531873
Boytsov LKolter Z(2021)Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency BenefitsAdvances in Information Retrieval10.1007/978-3-030-72113-8_5(63-78)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_5
Gamzu IHaikin MHalabi NHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Query Rewriting for Voice Shopping Null QueriesProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401052(1369-1378)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401052
Crimp RTrotman A(2018)Refining Query Expansion Terms using Query ContextProceedings of the 23rd Australasian Document Computing Symposium10.1145/3291992.3292000(1-4)Online publication date: 11-Dec-2018
https://dl.acm.org/doi/10.1145/3291992.3292000
Xiong CLiu ZCallan JLiu TCollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Towards Better Text Understanding and Retrieval through Kernel Entity Salience ModelingThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3209982(575-584)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3209982
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents