research-article

Document selection methodologies for efficient and effective learning-to-rank

Authors:

Javed A. Aslam,

Evangelos Kanoulas,

Emine YilmazAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 468 - 475

https://doi.org/10.1145/1571941.1572022

Published: 19 July 2009 Publication History

Abstract

Learning-to-rank has attracted great attention in the IR community. Much thought and research has been placed on query-document feature extraction and development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on selecting documents for learning-to-rank data sets nor on the effect of these choices on the efficiency and effectiveness of learning-to-rank algorithms.

In this paper, we employ a number of document selection methodologies, widely used in the context of evaluation--depth-k pooling, sampling (infAP, statAP), active-learning (MTC), and on-line heuristics (hedge). Certain methodologies, e.g. sampling and active-learning, have been shown to lead to efficient and effective evaluation. We investigate whether they can also enable efficient and effective learning-to-rank. We compare them with the document selection methodology used to create the LETOR datasets.

Further, all of the utilized methodologies are different in nature, and thus they construct training data sets with different properties, such as the proportion of relevant documents in the data or the similarity among them. We study how such properties affect the efficiency, effectiveness, and robustness of learning-to-rank collections.

References

[1]

J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In J. Callan, G. Cormack, C. Clarke, D. Hawking, and A. Smeaton, editors, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 393--394. ACM Press, July 2003.

Digital Library

[2]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In S. Dumais, E. N. Efthimiadis, D. Hawking, and K. Jarvelin, editors, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 541--548. ACM Press, August 2006.

Digital Library

[3]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 89--96, New York, NY, USA, 2005. ACM.

Digital Library

[4]

C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In B. Schölkopf, J. C. Platt, T. Homan, B. Schölkopf, J. C. Platt, and T. Homan, editors, NIPS, pages 193--200. MIT Press, 2006.

[5]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 268--275, 2006.

Digital Library

[6]

B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation over thousands of queries. In S.-H. Myaeng, D. W. Oard, F. Sebastiani, T.-S. Chua, and M.-K. Leong, editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 651--658. ACM Press, July 2008.

Digital Library

[7]

W. B. Croft, A. Moat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug. 1998. Information Science, 2008.

[8]

Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4:933--969, 2003.

Digital Library

[9]

D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1--19. U.S. Government Printing Office, Apr. 1995.

Digital Library

[10]

T. Joachims. A support vector method for multivariate performance measures. In International Conference on Machine Learning (ICML), pages 377--384, 2005.

Digital Library

[11]

T. Joachims. Training linear SVMs in linear time. In ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (KDD), pages 217--226, 2006.

Digital Library

[12]

K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6):779--808, 2000.

Digital Library

[13]

T.-Y. Liu, T. Qin, J. Xu, X. Wenying, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval.

[14]

T. Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR '07: Proceedings of the Learning to Rank workshop in the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007.

[15]

T. Minka and S. Robertson. Selection bias in the letor datasets. In SIGIR '08: Proceedings of the of the Learning to Rank workshop 31st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2008. ACM.

[16]

V. Pavlu. Large Scale IR Evaluation. PhD thesis, Northeastern University, College of Computer and Information Science, 2008.

[17]

T. Qin, T.-Y. Liu, J. Xu, and H. Li. How to make letor more useful and reliable. In SIGIR '08: Proceedings of the of the Learning to Rank workshop 31st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2008. ACM.

[18]

A. Singhal and G. Inc. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24:2001, 2001.

[19]

M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 585--593, New York, NY, USA, 2006. ACM.

Digital Library

[20]

E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1--24, 1999.

[21]

E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In P. S. Yu, V. Tsotras, E. Fox, and B. Liu, editors, Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management, pages 102--111. ACM Press, November 2006.

Digital Library

[22]

C. Zhai and J. Laerty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, 2004.

Digital Library

[23]

J. Zobel. How reliable are the results of large-scale retrieval experiments? In Croft et al. {7}, pages 307--314.

Digital Library

Cited By

Wu XPuthenputhussery AShang HKang CFang Y(2024)Meta Learning to Rank for Sparsely Supervised QueriesACM Transactions on Information Systems10.1145/3698876Online publication date: 8-Oct-2024
https://doi.org/10.1145/3698876
Lucchese CMarcuzzi FOrlando SHong JLanperne MPark JCerny TShahriar H(2023)On the Effect of Low-Ranked Documents: A New Sampling Function for Selective Gradient BoostingProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577597(646-652)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577597
Zamani HBendersky MMetzler DZhuang HWang XCrestani FPasi GGaussier E(2022)Stochastic Retrieval-Conditioned RerankingProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545141(81-91)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3539813.3545141
Show More Cited By

Index Terms

Document selection methodologies for efficient and effective learning-to-rank
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

A passage-based approach to learning to rank documents
Abstract
According to common relevance-judgments regimes, such as TREC’s, a document can be deemed relevant to a query even if it contains a very short passage of text with pertinent information. This fact has motivated work on passage-based document ...
Learning to Rank with Selection Bias in Personal Search
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past,...
An Empirical Perspective on Learning-to-rank
ICCAI '23: Proceedings of the 2023 9th International Conference on Computing and Artificial Intelligence

Learning-to-rank has been widely studied and applied in document retrieval. Typically, existing learning-to-rank methods treat ranking as an independent matching process among different queries. Hence, their ranking functions remain unchanged once the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
1,005
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)3

Reflects downloads up to 09 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu XPuthenputhussery AShang HKang CFang Y(2024)Meta Learning to Rank for Sparsely Supervised QueriesACM Transactions on Information Systems10.1145/3698876Online publication date: 8-Oct-2024
https://doi.org/10.1145/3698876
Lucchese CMarcuzzi FOrlando SHong JLanperne MPark JCerny TShahriar H(2023)On the Effect of Low-Ranked Documents: A New Sampling Function for Selective Gradient BoostingProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577597(646-652)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577597
Zamani HBendersky MMetzler DZhuang HWang XCrestani FPasi GGaussier E(2022)Stochastic Retrieval-Conditioned RerankingProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545141(81-91)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3539813.3545141
Wang QWu WQi YZhao Y(2022)Deep Bayesian Active Learning for Learning to Rank: A Case Study in Answer SelectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305689434:11(5251-5262)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TKDE.2021.3056894
Forman G(2021)Getting Your Package to the Right Place: Supervised Machine Learning for GeolocationMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-030-86514-6_25(403-419)Online publication date: 10-Sep-2021
https://doi.org/10.1007/978-3-030-86514-6_25
Penha GHauff C(2021)Weakly Supervised Label SmoothingAdvances in Information Retrieval10.1007/978-3-030-72240-1_33(334-341)Online publication date: 30-Mar-2021
https://doi.org/10.1007/978-3-030-72240-1_33
Chen BHu XHuo YDeng X(2020)Research on Recommendation Method of Product Design Scheme Based on Multi-Way Tree and Learning-to-RankMachines10.3390/machines80200308:2(30)Online publication date: 5-Jun-2020
https://doi.org/10.3390/machines8020030
Scells HZuccon GSharaf MKoopman B(2020)Sampling Query Variations for Learning to Rank to Improve Automatic Boolean Query Generation in Systematic ReviewsProceedings of The Web Conference 202010.1145/3366423.3380075(3041-3048)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380075
Ibrahim M(2019)Reducing correlation of random forest–based learning‐to‐rank algorithms using subsample sizeComputational Intelligence10.1111/coin.1221335:4(774-798)Online publication date: 29-Apr-2019
https://doi.org/10.1111/coin.12213
Connamacher HPancha NLiu RRay S(2019)Rankboost $$+$$ + : an improvement to RankboostMachine Learning10.1007/s10994-019-05826-xOnline publication date: 12-Aug-2019
https://doi.org/10.1007/s10994-019-05826-x
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents