research-article

Examining the Additivity of Top-k Query Processing Innovations

Authors:

Joel Mackenzie,

Alistair MoffatAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 1085 - 1094

https://doi.org/10.1145/3340531.3412000

Published: 19 October 2020 Publication History

Abstract

Research activity spanning more than five decades has led to index organizations, compression schemes, and traversal algorithms that allow extremely rapid response to ranked queries against very large text collections. However, little attention has been paid to the interactions between these many components, and the additivity of algorithmic improvements has not been explored. Here we examine the extent to which efficiency improvements add up. We employ four query processing algorithms, four compression codecs, and all possible combinations of four distinct further optimizations, and compare the performance of the 256 resulting systems to determine when and how different optimizations interact. Our results over two test collections show that efficiency enhancements are, for the most part, additive, and that there is little risk of negative interactions. In addition, our detailed profiling across this large pool of systems leads to key insights as to why the various individual enhancements work well, and indicates that optimizing "simpler" implementations can result in higher query throughput than is available from non-optimized versions of the more "complex" techniques, with clear implications for the choices needing to be made by practitioners.

Supplementary Material

MP4 File (3340531.3412000.mp4)

This overview talk accompanies the CIKM 2020 paper "Examining the Additivity of Top-k Query Processing Innovations" by Joel Mackenzie and Alistair Moffat. Please see the paper for more details.

Download
23.60 MB

References

[1]

M. Akcay, I. S. Altingovde, C. Macdonald, and I. Ounis. On the additivity and weak baselines for search result diversification research. In Proc. ICTIR, pages 109--116, 2017.

Digital Library

[2]

J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million query track 2007 overview. In Proc. TREC, 2007.

[3]

J. Allan, J. A. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million query track 2008 overview. In Proc. TREC, 2008.

[4]

V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In Proc. SIGIR, pages 35--42, 2001.

Digital Library

[5]

T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. CIKM, pages 601--610, 2009.

Digital Library

[6]

X. Bai, I. Arapakis, B. B. Cambazoglu, and A. Freire. Understanding and leveraging the impact of response latency on user behaviour in web search. ACM Trans. Inf. Sys., 36 (2): 1--42, 2017.

Digital Library

[7]

R. Blanco and A. Barreiro. TSP and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9 (4): 499--517, 2006.

Digital Library

[8]

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. CIKM, pages 426--434, 2003.

Digital Library

[9]

E. W. Brown. Fast evaluation of structured queries for information retrieval. In Proc. SIGIR, pages 30--38, 1995.

Digital Library

[10]

B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. Million query track 2009 overview. In Proc. TREC, 2009.

[11]

C. L. A. Clarke, J. S. Culpepper, and A. Moffat. Assessing efficiency-effectiveness tradeoffs in multi-stage retrieval systems without using relevance judgments. Inf. Retr., 19 (4): 351--377, 2016.

Digital Library

[12]

M. Crane, A. Trotman, and R. O'Keefe. Maintaining discriminatory power in quantized indexes. In Proc. CIKM, pages 1221--1224, 2013.

Digital Library

[13]

M. Crane, J. S. Culpepper, J. Lin, J. Mackenzie, and A. Trotman. A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In Proc. WSDM, pages 201--210, 2017.

Digital Library

[14]

M. F. Dacrema, P. Cremonesi, and D. Jannach. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proc. RecSys, pages 101--109, 2019.

Digital Library

[15]

C. M. Daoud, E. S. de Moura, A. Carvalho, A. S. da Silva, D. Fernandes, and C. Rossi. Fast top-k preserving query processing using two-tier indexes. Inf. Proc. & Man., 52 (5): 855--872, 2016.

Digital Library

[16]

C. M. Daoud, E. S. de Moura, D. Fernandes, A. S. da Silva, C. Rossi, and A. Carvalho. Waves: A fast multi-tier top-k query processing algorithm. Inf. Retr., 20 (3): 292--316, 2017.

Digital Library

[17]

L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proc. KDD, pages 1535--1544, 2016.

Digital Library

[18]

S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. SIGIR, pages 993--1002, 2011.

Digital Library

[19]

S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proc. WWW, pages 311--320, 2010.

Digital Library

[20]

M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB, 4 (12): 1213--1224, 2011.

Digital Library

[21]

A. Grand, R. Muir, J. Ferenczi, and J. Lin. From MaxScore to Block-Max Wand: The story of how Lucene significantly improved query evaluation performance. In Proc. ECIR, pages 20--27, 2020.

Digital Library

[22]

er and Hanbury(2019)]hh19-osirrcS. Hofst"atter and A. Hanbury. Let's measure run time! Extending the IR replicability infrastructure to include performance aspects. In Proc. OSIRRC at SIGIR 2019, pages 12--16, 2019.

[23]

C. Kamphuis, A. P. de Vries, L. Boytsov, and J. Lin. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proc. ECIR, pages 28--34, 2020.

Digital Library

[24]

A. Kane and F. W. Tompa. Split-lists and initial thresholds for WAND-based search. In Proc. SIGIR, pages 877--880, 2018.

Digital Library

[25]

S. Kharazmi, F. Scholer, D. Vallet, and M. Sanderson. Examining additivity and weak baselines. ACM Trans. Inf. Sys., 34 (4), 2016.

Digital Library

[26]

Y. Kim, J. Callan, J. S. Culpepper, and A. Moffat. Does selective search benefit from WAND optimization? In Proc. ECIR, pages 145--158, 2016.

[27]

D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Soft. Prac. & Exp., 41 (1): 1--29, 2015.

Digital Library

[28]

J. Lin and P. Yang. The impact of score ties on repeatability in document ranking. In Proc. SIGIR, pages 1125--1128, 2019.

Digital Library

[29]

J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. Toward reproducible baselines: The open-source IR reproducibility challenge. In Proc. ECIR, pages 408--420, 2016.

[30]

C. Macdonald and N. Tonellotto. Upper bound approximation for BlockMaxWand. In Proc. ICTIR, pages 273--276, 2017.

Digital Library

[31]

C. Macdonald, I. Ounis, and N. Tonellotto. Upper-bound approximations for dynamic pruning. ACM Trans. Inf. Sys., 29 (4): 17.1--17.28, 2011.

[32]

C. Macdonald, R. L. T. Santos, I. Ounis, and B. He. About learning models with multiple query-dependent features. ACM Trans. Inf. Sys., 31 (3): 11.1--11.39, 2013.

[33]

J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proc. ECIR, pages 339--352, 2019.

Digital Library

[34]

J. Mackenzie, Z. Dai, L. Gallagher, and J. Callan. Efficiency implications of term weighting for passage retrieval. In Proc. SIGIR, pages 1821--1824, 2020.

Digital Library

[35]

A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster BlockMax WAND with variable-sized blocks. In Proc. SIGIR, pages 625--634, 2017.

Digital Library

[36]

Mallia, Siedlaczek, Mackenzie, and Suel]pisa19-osirrcA. Mallia, M. Siedlaczek, J. Mackenzie, and T. Suel. PISA: Performant indexes and search for academia. In Proc. OSIRRC at SIGIR 2019, pages 50--56, 2019 a.

[37]

Mallia, Siedlaczek, and Suel]mss19-ecirA. Mallia, M. Siedlaczek, and T. Suel. An experimental study of index compression and DAAT query processing methods. In Proc. ECIR, pages 353--368, 2019 b.

Digital Library

[38]

A. Mallia, M. Siedlaczek, M. Sun, and T. Suel. A comparison of top-k threshold estimation techniques for disjunctive query processing. In Proc. CIKM, 2020. To appear.

Digital Library

[39]

A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3 (1): 25--47, 2000.

Digital Library

[40]

S. Mohammed, M. Crane, and J. Lin. Quantization in append-only collections. In Proc. ICTIR, pages 265--268, 2017.

Digital Library

[41]

G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In Proc. SIGIR, pages 273--282, 2014.

Digital Library

[42]

M. Petri, J. S. Culpepper, and A. Moffat. Exploring the magic of WAND. In Proc. Aust. Doc. Comp. Symp., pages 58--65, 2013.

Digital Library

[43]

M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. Accelerated query processing via similarity score prediction. In Proc. SIGIR, pages 485--494, 2019.

Digital Library

[44]

G. E. Pibiri and R. Venturini. Techniques for inverted index compression, 2019. arXiv:1908.10598.

[45]

S. E. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trnd. Inf. Retr., 3: 333--389, 2009.

Digital Library

[46]

E. Schurman and J. Brutlag. Performance related changes and their user impact. Velocity, 2009.

[47]

W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Inf. Proc. & Man., 39 (1): 117--131, 2003.

Digital Library

[48]

F. Silvestri. Sorting out the document identifier assignment problem. In Proc. ECIR, pages 101--112, 2007.

[49]

F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. SIGIR, pages 305--312, 2004.

Digital Library

[50]

A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. SIMD-based decoding of posting lists. In Proc. CIKM, pages 317--326, 2011.

Digital Library

[51]

T. Strohman, H. Turtle, and W. B. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages 219--225, 2005.

Digital Library

[52]

N. Tonellotto, C. Macdonald, and I. Ounis. Efficient query processing for scalable web search. Found. Trnd. Inf. Retr., 12 (4--5): 319--500, 2018.

Digital Library

[53]

A. Trotman, X.-F. Jia, and M. Crane. Towards an efficient and effective search engine. In Proc. OSIR at SIGIR 2012, pages 40--47, 2012.

[54]

A. Trotman, A. Puurula, and B. Burgess. Improvements to BM25 and language models examined. In Proc. Aust. Doc. Comp. Symp., pages 58--65, 2014.

Digital Library

[55]

H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Proc. & Man., 31 (6): 831--850, 1995.

Digital Library

[56]

L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proc. SIGIR, pages 105--114, 2011.

Digital Library

[57]

Q. Wang and T. Suel. Document reordering for faster intersection. Proc. VLDB, 12 (5): 475--487, 2019.

Digital Library

[58]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999.

Digital Library

[59]

E. Yafay and I. S. Altingovde. Caching scores for faster query processing with dynamic pruning in search engines. In Proc. CIKM, pages 2457--2460, 2019.

Digital Library

[60]

H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. WWW, pages 401--410, 2009.

Digital Library

[61]

W. Yang, K. Lu, P. Yang, and J. Lin. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In Proc. SIGIR, pages 1129--1132, 2019.

[62]

Z. Yang, A. Moffat, and A. Turpin. How precise does document scoring need to be? In Proc. Asia Info. Retri. Soc. Conf., pages 279--291, 2016.

Digital Library

[63]

J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comp. Surv., 38 (2): 6:1--6:56, 2006.

Digital Library

Cited By

Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Yigit-Sert SAltingovde IUlusoy Ö(2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103795
Mackenzie JTrotman ALin J(2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3576922
Show More Cited By

Index Terms

Examining the Additivity of Top-k Query Processing Innovations
1. Information systems
  1. Information retrieval

Recommendations

Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...
Interleaving a Join Sequence with Semijoins in Distributed Query Processing

The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join ...
Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and Engineering
Abstract
In recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mansour WZhuang SZuccon GMackenzie JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Revisiting Document Expansion and Filtering for Effective First-Stage RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657850(186-196)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657850
Yigit-Sert SAltingovde IUlusoy Ö(2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103795
Mackenzie JTrotman ALin J(2023)Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse RepresentationsACM Transactions on Information Systems10.1145/357692241:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3576922
Li ZMackenzie JChen HDuh WHuang HKato MMothe JPoblete B(2023)Profiling and Visualizing Dynamic Pruning AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591806(3125-3129)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591806
Moffat AMackenzie J(2023)Efficient immediate-access dynamic indexingInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10324860:3Online publication date: 24-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103248
Christmann PSaha Roy RWeikum GSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)Beyond NEDProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498488(172-180)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498488
Mackenzie JPetri MMoffat A(2022)Tradeoff Options for Bipartite Graph PartitioningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3208902(1-15)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3208902
Liu XPan YLi YWang GLiu X(2022)An NVM SSD-Based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.316055735:6(5612-5625)Online publication date: 18-Mar-2022
https://dl.acm.org/doi/10.1109/TKDE.2022.3160557
Mackenzie JPetri MMoffat A(2022)Efficient query processing techniques for next-page retrievalInformation Retrieval10.1007/s10791-021-09402-725:1(27-43)Online publication date: 18-Jan-2022
https://dl.acm.org/doi/10.1007/s10791-021-09402-7
Mackenzie JMoffat A(2021)Cost-Effective Updating of Distributed Reordered IndexesProceedings of the 25th Australasian Document Computing Symposium10.1145/3503516.3503528(1-8)Online publication date: 9-Dec-2021
https://dl.acm.org/doi/10.1145/3503516.3503528
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents