Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3132847.3132888acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

ANS-Based Index Compression

Published: 06 November 2017 Publication History

Abstract

Techniques for effectively representing the postings lists associated with inverted indexes have been studied for many years. Here we combine the recently developed "asymmetric numeral systems" (ANS) approach to entropy coding and a range of previous index compression methods, including VByte, Simple, and Packed. The ANS mechanism allows each of them to provide markedly improved compression effectiveness, at the cost of slower decoding rates. Using the 426 GB GOV2 collection, we show that the combination of blocking and ANS-based entropy-coding against a set of 16 magnitude-based probability models yields compression effectiveness superior to most previous mechanisms, while still providing reasonable decoding speed.

References

[1]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8 (1): 151--166, 2005.
[2]
V. N. Anh and A. Moffat. Index compression using 64-bit words. Soft. Prac. & Exp., 40 (2): 131--147, 2010.
[3]
, Navarro, and Esteller}bfne03spireN. R. Brisaboa, A. Fari na, G. Navarro, and M. F. Esteller. $(S, C)$-Dense coding: An optimized compression code for natural language text databases. In Proc. SPIRE, pages 122--136, 2003.
[4]
, Ladra, and Navarro}bfln08sigirN. R. Brisaboa, A. Fari na, S. Ladra, and G. Navarro. Reorganizing compressed text. In Proc. SIGIR, pages 139--146, 2008.
[5]
Y. Choueka, A. S. Fraenkel, and S. T. Klein. Compression of concordances in full-text retrieval systems. In Proc. SIGIR, pages 597--612, 1988.
[6]
J. S. Culpepper and A. Moffat. Enhanced byte codes with restricted prefix properties. In Proc. SPIRE, pages 1--12, 2005.
[7]
C. Dimopoulos, S. Nepomnyachiy, and T. Suel. Optimizing top-k document retrieval strategies for block-max indexes. In Proc. WSDM, pages 113--122, 2013.
[8]
S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. SIGIR, pages 993--1002, 2011.
[9]
J. Duda. Asymmetric numeral systems. CoRR, abs/0902.0271, 2009. URL http://arxiv.org/abs/0902.0271.
[10]
J. Duda. Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding. CoRR, abs/1311.2540, 2013. URL http://arxiv.org/abs/1311.2540.
[11]
A. S. Fraenkel and S. T. Klein. Novel compression of sparse bit-strings: Preliminary report. In Combinatorial Algorithms on Words, Volume 12, NATO ASI Series F, pages 169--183. Springer, 1985.
[12]
F. Giesen. Interleaved entropy coders. CoRR, abs/1402.3392, 2014.
[13]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Soft. Prac. & Exp., 45 (1): 1--29, 2015.
[14]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3 (1): 25--47, 2000.
[15]
A. Moffat and A. Turpin. Compression and Coding Algorithms. Kluwer, Boston, 2002.
[16]
A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. SIGIR, pages 274--285, 1992.
[17]
G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In Proc. SIGIR, pages 273--282, 2014.
[18]
G. Ottaviano, N. Tonellotto, and R. Venturini. Optimal space-time tradeoffs for inverted indexes. In Proc. WSDM, pages 47--56, 2015.
[19]
A. Trotman. Compressing inverted files. Inf. Retr., 6 (1): 5--19, 2003.
[20]
A. Trotman. Compression, SIMD, and postings lists. In Proc. Aust. Doc. Comp. Symp., page 50, 2014.
[21]
A. Trotman, M. Albert, and B. Burgess. Optimal packing in Simple-family codecs. In Proc. ICTIR, pages 337--340, 2015.
[22]
J. Wang, C. Lin, R. He, M. Chae, Y. Papakonstantinou, and S. Swanson. MILC: Inverted list compression in memory. PVLDB, 10 (8): 853--864, 2017.
[23]
H. E. Williams and J. Zobel. Compressing integers for fast file access. Comp. J., 42 (3): 193--201, 1999.
[24]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.
[25]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. WWW, pages 401--410, 2009.
[26]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. WWW, pages 387--396, 2008.
[27]
Z. Zhang, J. Tong, H. Huang, J. Liang, T. Li, R. J. Stones, G. Wang, and X. Liu. Leveraging context-free grammar for efficient inverted index compression. In Proc. SIGIR, pages 275--284, 2016.
[28]
an, Nes, and Boncz}zhnb06icdeM. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-scalar RAM-CPU cache compression. In Proc. ICDE, page 59, 2006.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asymmetric numeral systems
  2. entropy coder
  3. index compression
  4. inverted index
  5. postings list

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Asymmetric Encoding - Decoding Scheme for Lossless Data Compression2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619702(55-60)Online publication date: 7-Jul-2024
  • (2023)Rescaling of Symbol Counts for Adaptive rANS Coding2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289961(585-589)Online publication date: 4-Sep-2023
  • (2023)A Novel ANS Coding with Low Computational Complexity2023 IEEE/CIC International Conference on Communications in China (ICCC Workshops)10.1109/ICCCWorkshops57813.2023.10233773(1-6)Online publication date: 10-Aug-2023
  • (2022)A Review of the Asymmetric Numeral System and Its Applications to Digital ImagesEntropy10.3390/e2403037524:3(375)Online publication date: 7-Mar-2022
  • (2022)Improved structures to solve aggregated queries for trips over public transportation networksInformation Sciences: an International Journal10.1016/j.ins.2021.10.079584:C(752-783)Online publication date: 1-Jan-2022
  • (2021)Redundancy and Optimization of tANS Entropy EncodersIEEE Transactions on Multimedia10.1109/TMM.2020.304054723(4341-4350)Online publication date: 2021
  • (2020)Techniques for Inverted Index CompressionACM Computing Surveys10.1145/341514853:6(1-36)Online publication date: 6-Dec-2020
  • (2019)Huffman CodingACM Computing Surveys10.1145/334255552:4(1-35)Online publication date: 30-Aug-2019
  • (2019)Fast Dictionary-Based Compression for Inverted IndexesProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3290962(6-14)Online publication date: 30-Jan-2019
  • (2019)Fast Construction of Almost Optimal Symbol Distributions for Asymmetric Numeral Systems2019 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT.2019.8849430(1682-1686)Online publication date: Jul-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media