Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fewer topics? A million topics? Both?! On topics subsets in test collections

Published: 01 February 2020 Publication History

Abstract

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

References

[1]
Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC.
[2]
Bartlett JE, Kotrlik JW, and Higgins CC Organizational research: Determining appropriate sample size in survey research Information Technology, Learning, and Performance Journal 2001 19 1 43-50
[3]
Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9).
[4]
Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM.
[5]
Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40).
[6]
Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275).
[7]
Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC.
[8]
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300).
[9]
Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07.
[10]
Carterette BA Multiple testing in statistical analysis of systems-based information retrieval experiments ACM Transactions on Information Systems (TOIS) 2012 30 1 4
[11]
Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345).
[12]
Feise R Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology 2002 2 8
[13]
Guiver J, Mizzaro S, and Robertson S A few good topics: Experiments in topic set reduction for retrieval evaluation ACM Transactions on Information Systems 2009 21 1–21 26
[14]
Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165).
[15]
Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862).
[16]
Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12.
[17]
Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646)
[18]
Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931.
[19]
Kutlu M, Elsayed T, and Lease M Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging Information Processing and Management 2018 54 1 37-59
[20]
Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15.
[21]
Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486).
[22]
Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54).
[23]
Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University.
[24]
Rajaraman A and Ullman JD Mining of massive datasets 2011 1 Cambridge Cambridge University Press
[25]
Robertson Stephen On the Contributions of Topics to System Evaluation Lecture Notes in Computer Science 2011 Berlin, Heidelberg Springer Berlin Heidelberg 129-140
[26]
Roitero K, Maddalena E, and Mizzaro S Jose JM, Hauff C, Altıngovde IS, Song D, Albakour D, Watt S, and Tait J Do easy topics predict effectiveness better than difficult topics? Advances in information retrieval 2017 Cham Springer International Publishing 605-611
[27]
Roitero K, Soprano M, Brunello A, and Mizzarom S Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation ACM Journal of Data and Information Quality 2018 10 3 12:1-12:21
[28]
Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18.
[29]
Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press.
[30]
Sakai, T. (2007), Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR Conference on research and development in information retrieval, (pp. 71–78). New York, NY:ACM, SIGIR ’07.
[31]
Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70).
[32]
Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM.
[33]
Sakai T Topic set size design Information Retrieval Journal 2016 19 3 256-283
[34]
Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840).
[35]
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169).
[36]
Sheskin D Handbook of parametric and nonparametric statistical procedures 2007 4 Boca Raton CRC Press
[37]
Urbano J Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation Information Retrieval Journal 2016 19 3 313-350
[38]
Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402).
[39]
Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18.
[40]
Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323).
[41]
Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580).

Cited By

View all
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
  • (2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023

Index Terms

  1. Fewer topics? A million topics? Both?! On topics subsets in test collections
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Information Retrieval
        Information Retrieval  Volume 23, Issue 1
        Feb 2020
        113 pages

        Publisher

        Kluwer Academic Publishers

        United States

        Publication History

        Published: 01 February 2020
        Accepted: 26 April 2019
        Received: 23 November 2017

        Author Tags

        1. Retrieval evaluation
        2. Few topics
        3. Statistical significance
        4. Topic clustering

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
        • (2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media