Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Document reordering for faster intersection

Published: 01 January 2019 Publication History

Abstract

A lot of research has studied how to optimize inverted index structures in search engines through suitable reassignment of document identifiers. This approach was originally proposed to allow for better compression of the index, but subsequent work showed that it can also result in significant speed-ups for conjunctive queries and even certain types of disjunctive top-k algorithms. However, we do not have a good understanding of why this happens, and how we could directly optimize an index for query processing speed. As a result, existing techniques attempt to optimize for size, and treat speed increases as a welcome side-effect.
In this paper, we take an initial but important step towards understanding and modeling speed increases due to document reordering. We define the problem of minimizing the cost of queries given an inverted index and a query distribution, relate it to work on adaptive set intersection, and show that it is fundamentally different from that of minimizing compressed index size. We then propose a heuristic algorithm for finding a document reordering that minimizes query processing costs under suitable cost models. Our experiments show significant increases in the speed of intersections over state-of-the-art reordering techniques.

References

[1]
R. R. Amossen and R. Pagh. A new data layout for set intersection on gpus. In 2011 IEEE International Parallel Distributed Processing Symposium, pages 698--708, 2011.
[2]
N. Ao, F. Zhang, D. Wu, D. Stones, G. Wang, X. Liu, J. Liu, and S. Lin. Efficient parallel lists intersection and index compression algorithms using graphics processing units. PVLDB, 4(8):470--481, 2011.
[3]
R. A. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Combinatorial Pattern Matching, pages 400--408, 2004.
[4]
R. A. Baeza-Yates. Experimental analysis of a fast intersection algorithm for sorted sequences. In String Processing and Information Retrieval, pages 13--24, 2005.
[5]
J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Trans. Algorithms, 4(1):4:1--4:18, Mar. 2008.
[6]
J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Conference on Experimental Algorithms, pages 146--157, 2006.
[7]
J. Barbay, A. López-Ortiz, and T. Lu. An experimental investigation of set intersection algorithms for text searching. J. Exp. Algorithmics, 14:7:3.7--7:3.24, June 2010.
[8]
J. L. Bentley and A. C. Yao. An almost optimal algorithm for unbounded searching. Information Processing Letters, 5:82--87, 1976.
[9]
P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In International Symposium on Algorithms and Computation, pages 739--750, 2007.
[10]
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proceedings of the 29th European Conference on IR Research, pages 375--387, 2005.
[11]
D. Blandford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, page 342, 2002.
[12]
G. E. Blelloch and M. Reid-Miller. Fast set operations using treaps. In ACM Symposium on Parallel Algorithms and Architectures, pages 16--26, 1998.
[13]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 426--434, 2003.
[14]
M. R. Brown and R. E. Tarjan. A fast merging algorithm. J. ACM, 26:211--226, 1979.
[15]
M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In 2012 IEEE 28th International Conference on Data Engineering, pages 1360--1369, 2012.
[16]
K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In 2011 IEEE 27th International Conference on Data Engineering, pages 709--720, 2011.
[17]
J. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, 2010.
[18]
E. D. Demaine, A. López-Ortiz, and J. Munro. Adaptive set intersections, unions, and differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743--752, 2000.
[19]
E. D. Demaine, A. López-Ortiz, and J. Munro. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation, pages 91--104, 2001.
[20]
L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1535--1544, 2016.
[21]
C. Dimopoulos, S. Nepomnyachiy, and T. Suel. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 113--122, 2013.
[22]
B. Ding and A. C. Konig. Fast set intersection in memory. PVLDB, 4(4):255--266, 2011.
[23]
S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web, pages 311--320, 2010.
[24]
S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 993--1002, 2011.
[25]
B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. Bitfunnel: Revisiting signatures for search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605--614, 2017.
[26]
F. K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta Inf., 1(2):145--158, June 1971.
[27]
F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly-ordered sets. SIAM J. Comput, 1(1):31--39, June 1972.
[28]
A. Kane and F. Tompa. Skewed partial bitvectors for list intersection. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 263--272, 2014.
[29]
A. Kane and F. W. Tompa. Distribution by document size. In Workshop on Large-scale and Distributed Systems for Information Retrieval, 2014.
[30]
S. Khot and O. Regev. Vertex cover might be hard to approximate to within 2-ϵ J. Comput. Syst. Sci., 74(3):335--349, May 2008.
[31]
D. Lemire, L. Boytsov, and N. Kurz. Simd compression and the intersection of sorted integers. Softw. Pract. Exper., 46(6):723--749, June 2016.
[32]
A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. Faster blockmax wand with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 625--634, 2017.
[33]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, Jul 2000.
[34]
G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 273--282, 2014.
[35]
V. Ramaswamy, R. Konow, A. Trotman, J. Degenhardt, and N. Whyte. Document reordering is good, especially for e-commerce. In Proceedings of the SIGIR 2017 Workshop On eCommerce, 2017.
[36]
C. Rosset, D. Jose, G. Ghosh, B. Mitra, and S. Tiwary. Optimizing query evaluations using reinforcement learning for web search. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1193--1196, 2018.
[37]
P. Sanders and F. Transier. Intersection in integer inverted indices. In Proceedings of the Meeting on Algorithm Engineering and Experiments, pages 71--83, 2007.
[38]
B. Schlegel, T. Dresden, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd instructions. In In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage at VLDB, 2016.
[39]
W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Process. Management, 39(1):117--131, Jan. 2003.
[40]
F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research, pages 101--112, 2007.
[41]
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 305--312, 2004.
[42]
F. Silvestri and R. Venturini. Vsencoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pages 1219--1228, 2010.
[43]
S. Tatikonda, F. Junqueira, B. B. Cambazoglu, and V. Plachouras. On efficient posting list intersection with multicore processors. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 738--739, 2009.
[44]
D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. PVLDB, 2(1):838--849, 2009.
[45]
H. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Process. Management, 31(6):831--850, Nov. 1995.
[46]
L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 105--114, 2011.
[47]
D. Wu, F. Zhang, N. Ao, F. Wang, X. Liu, and G. Wang. A batched gpu algorithm for set intersection. In 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks, pages 752--756, 2009.
[48]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web, pages 401--410, 2009.
[49]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), July 2006.

Cited By

View all
  • (2024)Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-UpACM Transactions on Embedded Computing Systems10.1145/366063523:4(1-54)Online publication date: 20-Apr-2024
  • (2023)Optimizing Function Layout for Mobile ApplicationsProceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596277(52-63)Online publication date: 13-Jun-2023
  • (2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 5
January 2019
163 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2019
Published in PVLDB Volume 12, Issue 5

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-UpACM Transactions on Embedded Computing Systems10.1145/366063523:4(1-54)Online publication date: 20-Apr-2024
  • (2023)Optimizing Function Layout for Mobile ApplicationsProceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596277(52-63)Online publication date: 13-Jun-2023
  • (2023)Faster Dynamic Pruning via Reordering of Documents in Inverted IndexesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591987(2001-2005)Online publication date: 19-Jul-2023
  • (2022)Using Conjunctions for Faster Disjunctive Top-k QueriesProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498489(917-927)Online publication date: 11-Feb-2022
  • (2021)Cost-Effective Updating of Distributed Reordered IndexesProceedings of the 25th Australasian Document Computing Symposium10.1145/3503516.3503528(1-8)Online publication date: 9-Dec-2021
  • (2021)Faster Index Reordering with Bipartite Graph PartitioningProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462991(1910-1914)Online publication date: 11-Jul-2021
  • (2020)Examining the Additivity of Top-k Query Processing InnovationsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412000(1085-1094)Online publication date: 19-Oct-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media