Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340531.3412000acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Examining the Additivity of Top-k Query Processing Innovations

Published: 19 October 2020 Publication History

Abstract

Research activity spanning more than five decades has led to index organizations, compression schemes, and traversal algorithms that allow extremely rapid response to ranked queries against very large text collections. However, little attention has been paid to the interactions between these many components, and the additivity of algorithmic improvements has not been explored. Here we examine the extent to which efficiency improvements add up. We employ four query processing algorithms, four compression codecs, and all possible combinations of four distinct further optimizations, and compare the performance of the 256 resulting systems to determine when and how different optimizations interact. Our results over two test collections show that efficiency enhancements are, for the most part, additive, and that there is little risk of negative interactions. In addition, our detailed profiling across this large pool of systems leads to key insights as to why the various individual enhancements work well, and indicates that optimizing "simpler" implementations can result in higher query throughput than is available from non-optimized versions of the more "complex" techniques, with clear implications for the choices needing to be made by practitioners.

Supplementary Material

MP4 File (3340531.3412000.mp4)
This overview talk accompanies the CIKM 2020 paper "Examining the Additivity of Top-k Query Processing Innovations" by Joel Mackenzie and Alistair Moffat. Please see the paper for more details.

References

[1]
M. Akcay, I. S. Altingovde, C. Macdonald, and I. Ounis. On the additivity and weak baselines for search result diversification research. In Proc. ICTIR, pages 109--116, 2017.
[2]
J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million query track 2007 overview. In Proc. TREC, 2007.
[3]
J. Allan, J. A. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million query track 2008 overview. In Proc. TREC, 2008.
[4]
V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proc. SIGIR, pages 35--42, 2001.
[5]
T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. CIKM, pages 601--610, 2009.
[6]
X. Bai, I. Arapakis, B. B. Cambazoglu, and A. Freire. Understanding and leveraging the impact of response latency on user behaviour in web search. ACM Trans. Inf. Sys., 36 (2): 1--42, 2017.
[7]
R. Blanco and A. Barreiro. TSP and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9 (4): 499--517, 2006.
[8]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. CIKM, pages 426--434, 2003.
[9]
E. W. Brown. Fast evaluation of structured queries for information retrieval. In Proc. SIGIR, pages 30--38, 1995.
[10]
B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. Million query track 2009 overview. In Proc. TREC, 2009.
[11]
C. L. A. Clarke, J. S. Culpepper, and A. Moffat. Assessing efficiency-effectiveness tradeoffs in multi-stage retrieval systems without using relevance judgments. Inf. Retr., 19 (4): 351--377, 2016.
[12]
M. Crane, A. Trotman, and R. O'Keefe. Maintaining discriminatory power in quantized indexes. In Proc. CIKM, pages 1221--1224, 2013.
[13]
M. Crane, J. S. Culpepper, J. Lin, J. Mackenzie, and A. Trotman. A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In Proc. WSDM, pages 201--210, 2017.
[14]
M. F. Dacrema, P. Cremonesi, and D. Jannach. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proc. RecSys, pages 101--109, 2019.
[15]
C. M. Daoud, E. S. de Moura, A. Carvalho, A. S. da Silva, D. Fernandes, and C. Rossi. Fast top-k preserving query processing using two-tier indexes. Inf. Proc. & Man., 52 (5): 855--872, 2016.
[16]
C. M. Daoud, E. S. de Moura, D. Fernandes, A. S. da Silva, C. Rossi, and A. Carvalho. Waves: A fast multi-tier top-k query processing algorithm. Inf. Retr., 20 (3): 292--316, 2017.
[17]
L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proc. KDD, pages 1535--1544, 2016.
[18]
S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. SIGIR, pages 993--1002, 2011.
[19]
S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proc. WWW, pages 311--320, 2010.
[20]
M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB, 4 (12): 1213--1224, 2011.
[21]
A. Grand, R. Muir, J. Ferenczi, and J. Lin. From MaxScore to Block-Max Wand: The story of how Lucene significantly improved query evaluation performance. In Proc. ECIR, pages 20--27, 2020.
[22]
er and Hanbury(2019)]hh19-osirrcS. Hofst"atter and A. Hanbury. Let's measure run time! Extending the IR replicability infrastructure to include performance aspects. In Proc. OSIRRC at SIGIR 2019, pages 12--16, 2019.
[23]
C. Kamphuis, A. P. de Vries, L. Boytsov, and J. Lin. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proc. ECIR, pages 28--34, 2020.
[24]
A. Kane and F. W. Tompa. Split-lists and initial thresholds for WAND-based search. In Proc. SIGIR, pages 877--880, 2018.
[25]
S. Kharazmi, F. Scholer, D. Vallet, and M. Sanderson. Examining additivity and weak baselines. ACM Trans. Inf. Sys., 34 (4), 2016.
[26]
Y. Kim, J. Callan, J. S. Culpepper, and A. Moffat. Does selective search benefit from WAND optimization? In Proc. ECIR, pages 145--158, 2016.
[27]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Soft. Prac. & Exp., 41 (1): 1--29, 2015.
[28]
J. Lin and P. Yang. The impact of score ties on repeatability in document ranking. In Proc. SIGIR, pages 1125--1128, 2019.
[29]
J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. Toward reproducible baselines: The open-source IR reproducibility challenge. In Proc. ECIR, pages 408--420, 2016.
[30]
C. Macdonald and N. Tonellotto. Upper bound approximation for BlockMaxWand. In Proc. ICTIR, pages 273--276, 2017.
[31]
C. Macdonald, I. Ounis, and N. Tonellotto. Upper-bound approximations for dynamic pruning. ACM Trans. Inf. Sys., 29 (4): 17.1--17.28, 2011.
[32]
C. Macdonald, R. L. T. Santos, I. Ounis, and B. He. About learning models with multiple query-dependent features. ACM Trans. Inf. Sys., 31 (3): 11.1--11.39, 2013.
[33]
J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339--352, 2019.
[34]
J. Mackenzie, Z. Dai, L. Gallagher, and J. Callan. Efficiency implications of term weighting for passage retrieval. In Proc. SIGIR, pages 1821--1824, 2020.
[35]
A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster BlockMax WAND with variable-sized blocks. In Proc. SIGIR, pages 625--634, 2017.
[36]
Mallia, Siedlaczek, Mackenzie, and Suel]pisa19-osirrcA. Mallia, M. Siedlaczek, J. Mackenzie, and T. Suel. PISA: Performant indexes and search for academia. In Proc. OSIRRC at SIGIR 2019, pages 50--56, 2019 a.
[37]
Mallia, Siedlaczek, and Suel]mss19-ecirA. Mallia, M. Siedlaczek, and T. Suel. An experimental study of index compression and DAAT query processing methods. In Proc. ECIR, pages 353--368, 2019 b.
[38]
A. Mallia, M. Siedlaczek, M. Sun, and T. Suel. A comparison of top-k threshold estimation techniques for disjunctive query processing. In Proc. CIKM, 2020. To appear.
[39]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3 (1): 25--47, 2000.
[40]
S. Mohammed, M. Crane, and J. Lin. Quantization in append-only collections. In Proc. ICTIR, pages 265--268, 2017.
[41]
G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In Proc. SIGIR, pages 273--282, 2014.
[42]
M. Petri, J. S. Culpepper, and A. Moffat. Exploring the magic of WAND. In Proc. Aust. Doc. Comp. Symp., pages 58--65, 2013.
[43]
M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. Accelerated query processing via similarity score prediction. In Proc. SIGIR, pages 485--494, 2019.
[44]
G. E. Pibiri and R. Venturini. Techniques for inverted index compression, 2019. arXiv:1908.10598.
[45]
S. E. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trnd. Inf. Retr., 3: 333--389, 2009.
[46]
E. Schurman and J. Brutlag. Performance related changes and their user impact. Velocity, 2009.
[47]
W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Inf. Proc. & Man., 39 (1): 117--131, 2003.
[48]
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. ECIR, pages 101--112, 2007.
[49]
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. SIGIR, pages 305--312, 2004.
[50]
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. SIMD-based decoding of posting lists. In Proc. CIKM, pages 317--326, 2011.
[51]
T. Strohman, H. Turtle, and W. B. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages 219--225, 2005.
[52]
N. Tonellotto, C. Macdonald, and I. Ounis. Efficient query processing for scalable web search. Found. Trnd. Inf. Retr., 12 (4--5): 319--500, 2018.
[53]
A. Trotman, X.-F. Jia, and M. Crane. Towards an efficient and effective search engine. In Proc. OSIR at SIGIR 2012, pages 40--47, 2012.
[54]
A. Trotman, A. Puurula, and B. Burgess. Improvements to BM25 and language models examined. In Proc. Aust. Doc. Comp. Symp., pages 58--65, 2014.
[55]
H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Proc. & Man., 31 (6): 831--850, 1995.
[56]
L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proc. SIGIR, pages 105--114, 2011.
[57]
Q. Wang and T. Suel. Document reordering for faster intersection. Proc. VLDB, 12 (5): 475--487, 2019.
[58]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999.
[59]
E. Yafay and I. S. Altingovde. Caching scores for faster query processing with dynamic pruning in search engines. In Proc. CIKM, pages 2457--2460, 2019.
[60]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. WWW, pages 401--410, 2009.
[61]
W. Yang, K. Lu, P. Yang, and J. Lin. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In Proc. SIGIR, pages 1129--1132, 2019.
[62]
Z. Yang, A. Moffat, and A. Turpin. How precise does document scoring need to be? In Proc. Asia Info. Retri. Soc. Conf., pages 279--291, 2016.
[63]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comp. Surv., 38 (2): 6:1--6:56, 2006.

Cited By

View all
  • (2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
  • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
  • (2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. additivity
  2. dynamic pruning
  3. experimentation
  4. query processing

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Council

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
  • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
  • (2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
  • (2023)Profiling and Visualizing Dynamic Pruning AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591806(3125-3129)Online publication date: 19-Jul-2023
  • (2023)Efficient immediate-access dynamic indexingInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10324860:3Online publication date: 24-May-2023
  • (2022)Beyond NEDProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498488(172-180)Online publication date: 11-Feb-2022
  • (2022)Tradeoff Options for Bipartite Graph PartitioningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3208902(1-15)Online publication date: 2022
  • (2022)An NVM SSD-Based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.316055735:6(5612-5625)Online publication date: 18-Mar-2022
  • (2022)Efficient query processing techniques for next-page retrievalInformation Retrieval10.1007/s10791-021-09402-725:1(27-43)Online publication date: 18-Jan-2022
  • (2021)Cost-Effective Updating of Distributed Reordered IndexesProceedings of the 25th Australasian Document Computing Symposium10.1145/3503516.3503528(1-8)Online publication date: 9-Dec-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media