research-article

Fewer topics? A million topics? Both?! On topics subsets in test collections

Authors:

J. Shane Culpepper,

Mark Sanderson,

Stefano MizzaroAuthors Info & Claims

Information Retrieval Journal, Volume 23, Issue 1

Pages 49 - 85

https://doi.org/10.1007/s10791-019-09357-w

Published: 01 February 2020 Publication History

Abstract

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

References

[1]

Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC.

[2]

Bartlett JE, Kotrlik JW, and Higgins CC Organizational research: Determining appropriate sample size in survey research Information Technology, Learning, and Performance Journal 2001 19 1 43-50

[3]

Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9).

[4]

Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM.

[5]

Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40).

[6]

Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275).

[7]

Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC.

[8]

Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300).

[9]

Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07.

[10]

Carterette BA Multiple testing in statistical analysis of systems-based information retrieval experiments ACM Transactions on Information Systems (TOIS) 2012 30 1 4

[11]

Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345).

[12]

Feise R Do multiple outcome measures require

p

-value adjustment? BMC Medical Research Methodology 2002 2 8

[13]

Guiver J, Mizzaro S, and Robertson S A few good topics: Experiments in topic set reduction for retrieval evaluation ACM Transactions on Information Systems 2009 21 1–21 26

[14]

Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165).

[15]

Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862).

[16]

Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12.

[17]

Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646)

[18]

Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931.

[19]

Kutlu M, Elsayed T, and Lease M Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging Information Processing and Management 2018 54 1 37-59

[20]

Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15.

[21]

Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486).

[22]

Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54).

[23]

Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University.

[24]

Rajaraman A and Ullman JD Mining of massive datasets 2011 1 Cambridge Cambridge University Press

[25]

Robertson Stephen On the Contributions of Topics to System Evaluation Lecture Notes in Computer Science 2011 Berlin, Heidelberg Springer Berlin Heidelberg 129-140

[26]

Roitero K, Maddalena E, and Mizzaro S Jose JM, Hauff C, Altıngovde IS, Song D, Albakour D, Watt S, and Tait J Do easy topics predict effectiveness better than difficult topics? Advances in information retrieval 2017 Cham Springer International Publishing 605-611

[27]

Roitero K, Soprano M, Brunello A, and Mizzarom S Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation ACM Journal of Data and Information Quality 2018 10 3 12:1-12:21

[28]

Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18.

[29]

Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press.

[30]

Sakai, T. (2007), Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR Conference on research and development in information retrieval, (pp. 71–78). New York, NY:ACM, SIGIR ’07.

[31]

Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70).

[32]

Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM.

[33]

Sakai T Topic set size design Information Retrieval Journal 2016 19 3 256-283

[34]

Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840).

[35]

Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169).

[36]

Sheskin D Handbook of parametric and nonparametric statistical procedures 2007 4 Boca Raton CRC Press

[37]

Urbano J Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation Information Retrieval Journal 2016 19 3 313-350

[38]

Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402).

[39]

Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18.

[40]

Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323).

[41]

Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580).

Cited By

Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916

Index terms have been assigned to the content through auto-classification.

Recommendations

Effectiveness Evaluation with a Subset of Topics: A Practical Approach
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Several researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate ...
Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media

Topic modeling is an important tool in social media analysis, allowing researchers to quickly understand large text corpora by investigating the topics underlying them. One of the fundamental problems of topic models lies in how to assess the quality of ...
Detecting bursts in sentiment-aware topics from social media

Nowadays plenty of user-generated posts, e.g., sina weibos, are published on the social media. The posts contain the publics sentiments (i.e., positive or negative) towards various topics. Bursty sentiment-aware topics from these posts reveal sentiment-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 23, Issue 1

Feb 2020

113 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Nature B.V. 2019.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2020

Accepted: 26 April 2019

Received: 23 November 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents