Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3035918.3064007acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

An Experimental Study of Bitmap Compression vs. Inverted List Compression

Published: 09 May 2017 Publication History

Abstract

Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta. We observe that they essentially solve the same problem, i.e., how to store a collection of sorted integers with as few as possible bits and support query processing as fast as possible. Due to historical reasons, bitmap compression and inverted list compression were developed as two separated lines of research in the database area and information retrieval area. Thus, a natural question is: Which one is better between bitmap compression and inverted list compression?
To answer the question, we present the first comprehensive experimental study to compare a series of 9 bitmap compression methods and 12 inverted list compression methods. We compare these 21 algorithms on synthetic datasets with different distributions (uniform, zipf, and markov) as well as 8 real-life datasets in terms of the space overhead, decompression time, intersection time, and union time. Based on the results, we provide many lessons and guidelines that can be used for practitioners to decide which technique to adopt in future systems and also for researchers to develop new algorithms.

References

[1]
D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006.
[2]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. IR, 8(1):151--166, 2005.
[3]
V. N. Anh and A. Moffat. Index compression using 64-bit words. SPE, 40(2):131--147, 2010.
[4]
G. Antoshenkov. Byte-aligned bitmap compression. In DCC, page 476, 1995.
[5]
M. Athanassoulis, Z. Yan, and S. Idreos. Upbit: Scalable in-memory updatable bitmap indexing. In SIGMOD, pages 1319--1332, 2016.
[6]
R. A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In ICDE, pages 6--20, 2007.
[7]
L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003.
[8]
T. A. Bjørklund, N. Grimsmo, J. Gehrke, and O. Torbjørnsen. Inverted indexes vs. bitmap indexes in decision support systems. In CIKM, pages 1509--1512, 2009.
[9]
B. B. Cambazoglu and R. A. Baeza-Yates. Scalability and efficiency challenges in large-scale web search engines. In SIGIR, pages 1223--1226, 2016.
[10]
S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with roaring bitmaps. SPE, 46(5):709--719, 2016.
[11]
C. Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. In SIGMOD, pages 355--366, 1998.
[12]
C. Y. Chan and Y. E. Ioannidis. An efficient bitmap encoding scheme for selection queries. In SIGMOD, pages 215--226, 1999.
[13]
A. Colantonio and R. D. Pietro. Concise: Compressed 'n' composable integer set. IPL, 110(16):644--650, 2010.
[14]
J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. TOIS, 29(1):1--25, 2010.
[15]
D. R. Cutting and J. O. Pedersen. Optimizations for dynamic inverted index maintenance. In SIGIR, pages 405--411, 1990.
[16]
J. Dean. Challenges in building large-scale information retrieval systems: Invited talk. In WSDM, page 1, 2009.
[17]
F. Deliège and T. B. Pedersen. Position list word aligned hybrid: Optimizing space and performance for compressed bitmaps. In EDBT, pages 228--239, 2010.
[18]
H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book. Prentice Hall Press, 2008.
[19]
G. Guzun and G. Canahuate. Hybrid query optimization for hard-to-compress bit-vectors. VLDBJ, 25(3):339--354, 2016.
[20]
G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In ICDE, pages 484--495, 2014.
[21]
M. E. Haque, Y. H. Eom, Y. He, S. Elnikety, R. Bianchini, and K. S. McKinley. Few-to-many: Incremental parallelism for reducing tail latency in interactive services. In ASPLOS, pages 161--175, 2015.
[22]
A. S. Kesheng Wu, Ekow J. Otoo and H. Nordberg. Notes on design and implementation of compressed bit vectors, 2001.
[23]
S. Kim, J. Lee, S. R. Satti, and B. Moon. Sbh: Super byte-aligned hybrid bitmap compression. IS, 62:155--168, 2016.
[24]
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, and C. Bear. The vertica analytic database: C-store 7 years later. PVLDB, 5(12):1790--1801, 2012.
[25]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. SPE, 45(1):1--29, 2015.
[26]
D. Lemire, O. Kaser, and K. Aouiche. Sorting improves word-aligned bitmap indexes. DKE, 69(1):3--28, 2010.
[27]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[28]
S. Mehrotra, S. Chauhan, and H. Bansal. Apache Hive Cookbook. Packt Publishing, 2016.
[29]
P. O'Neil, E. O'Neil, X. Chen, and S. Revilak. The star schema benchmark and augmented fact table indexing. 2009.
[30]
G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In SIGIR, pages 273--282, 2014.
[31]
V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, and F.-L. Ling. Lazy, adaptive rid-list intersection, and its application to index anding. In SIGMOD, pages 773--784, 2007.
[32]
K. Stockinger, J. Cieslewicz, K. Wu, D. Rotem, and A. Shoshani. Using bitmap index for joint queries on structured and text data. In New Trends in Data Warehousing and Data Analysis, pages 1--23. 2009.
[33]
S. Tatikonda, B. B. Cambazoglu, and F. P. Junqueira. Posting list intersection on multicore architectures. In SIGIR, pages 963--972, 2011.
[34]
L. Thiel and H. Heaps. Program design for retrospective searches on large data bases. IPM, 8(1):1--20, 1972.
[35]
S. Vigna. Quasi-succinct indices. In WSDM, pages 83--92, 2013.
[36]
J. Wang, E. Lo, M. L. Yiu, J. Tong, G. Wang, and X. Liu. The impact of solid state drive on search engine cache management. In SIGIR, pages 693--702, 2013.
[37]
J. Wang, E. Lo, M. L. Yiu, J. Tong, G. Wang, and X. Liu. Cache design of ssd-based search engine architectures: An experimental study. TOIS, 32(4):1--26, 2014.
[38]
K. Wu, E. J. Otoo, and A. Shoshani. On the performance of bitmap indices for high cardinality attributes. In VLDB, pages 24--35, 2004.
[39]
K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. TODS, 31(1):1--38, 2006.
[40]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.
[41]
J. Yun, Y. He, S. Elnikety, and S. Ren. Optimal aggregation policy for reducing tail latency of web search. In SIGIR, pages 63--72, 2015.
[42]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In WWW, pages 387--396, 2008.
[43]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In ICDE, 2006.

Cited By

View all
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2024)FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point DataProceedings of the VLDB Endowment10.14778/3648160.364818017:6(1418-1431)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. bitmap compression
  3. evaluation
  4. inverted list compression
  5. main memory

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2024)FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point DataProceedings of the VLDB Endowment10.14778/3648160.364818017:6(1418-1431)Online publication date: 1-Feb-2024
  • (2024)Revisiting B-tree Compression: An Experimental StudyProceedings of the ACM on Management of Data10.1145/36549722:3(1-25)Online publication date: 30-May-2024
  • (2024)Practical Rateless Set ReconciliationProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672219(595-612)Online publication date: 4-Aug-2024
  • (2024)CStream: Parallel Data Stream Compression on Multicore Edge DevicesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338686236:11(5889-5904)Online publication date: Nov-2024
  • (2024)Data-Aware Adaptive Compression for Stream ProcessingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337771036:9(4531-4549)Online publication date: Sep-2024
  • (2024)Vertex Encoding for Edge Nonexistence Determination With SIMD AccelerationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335091936:7(3600-3614)Online publication date: Jul-2024
  • (2024)TMan: A High-Performance Trajectory Data Management System Based on Key-Value Stores2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00376(4951-4964)Online publication date: 13-May-2024
  • (2023)An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse VectorsACM Transactions on Information Systems10.1145/360979742:2(1-43)Online publication date: 8-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media