research-article

Open access

ALP: Adaptive Lossless floating-Point Compression

Authors:

Leonardo X. Kuffo,

Peter BonczAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 230, Pages 1 - 26

https://doi.org/10.1145/3626717

Published: 12 December 2023 Publication History

Abstract

IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as Zstd; while [de]compression is much slower than Delta and FOR.

We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio (Figure 1). We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals [31] to losslessly encode doubles as integers if they originated as decimals, and otherwise uses vectorized compression of the doubles' front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, using building blocks provided by our FastLanes library [6], and an efficient two-stage compression algorithm that first samples row-groups and then vectors.

References

[1]

2019. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754--2019 (Revision of IEEE 754--2008) (2019), 1--84. https://doi.org/10.1109/IEEESTD.2019.8766229

[2]

2019. Public BI Benchmark. https://github.com/cwida/public_bi_benchmark. Accessed on: 2023-04--13.

[3]

2023. FastLanes. https://github.com/cwida/FastLanes Accesed on: 2023-04--13.

[4]

Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 671--682.

Digital Library

[5]

Azim Afroozeh and P Boncz. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression.

[6]

Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code. Proc. VLDB Endow. 16, 9 (jul 2023), 2132--2144. https://doi.org/10.14778/3598581.3598587

Digital Library

[7]

Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Cidr, Vol. 5. 225--237.

[8]

Boudewijn Braams. 2018. Predicate Pushdown in Parquet and Apache Spark. MSc thesis (2018).

[9]

Andrea Bruno, Franco Maria Nardini, Giulio Ermanno Pibiri, Roberto Trani, and Rossano Venturini. 2021. TSXor: A Simple Time Series Compression Algorithm. In String Processing and Information Retrieval: 28th International Symposium, SPIRE 2021, Lille, France, October 4--6, 2021, Proceedings 28. Springer, 217--223.

[10]

Martin Burtscher and Paruj Ratanaworabhan. 2008. FPC: A high-speed compressor for double-precision floating-point data. IEEE transactions on computers 58, 1 (2008), 18--31.

[11]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. CoRR abs/2104.14294 (2021). arXiv:2104.14294 https://arxiv.org/abs/2104.14294

[12]

Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, et al. 2019. Procella: Unifying serving and analytical data at YouTube.(2019).

Digital Library

[13]

Yann Collet. 2014. LZ4 - Extremely fast compression. https://github.com/lz4/lz4 Accesed on: 2023-04--13.

[14]

Yann Collet. 2015. Zstandard - Fast real-time compression algorithm. https://github.com/facebook/zstd Accesed on: 2023-04--13.

[15]

Patrick Damme, Dirk Habich, Juliana Hildebrandt, and Wolfgang Lehner. 2017. Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses). In EDBT. 72--83.

[16]

Vadim Engelson, Peter Fritzson, and Dag Fritzson. 2000. Lossless compression of high-volume numerical data from simulations.

[17]

Agner Fog et al. 2011. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Copenhagen University College of Engineering 93 (2011), 110. https://www.agner.org/optimize/instruction_tables.pdf

[18]

Nathaniel Fout and Kwan-Liu Ma. 2012. An adaptive prediction-based approach to lossless compression of floating-point volume data. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2295--2304.

Digital Library

[19]

David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR) 23, 1 (1991), 5--48.

[20]

Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings 14th International Conference on Data Engineering. IEEE, 370--379.

[21]

John L Gustafson and Isaac T Yonemoto. 2017. Beating floating point at its own game: Posit arithmetic. Supercomputing frontiers and innovations 4, 2 (2017), 71--86.

[22]

Martin Isenburg, Peter Lindstrom, and Jack Snoeyink. 2005. Lossless compression of predicted floating-point geometry. Computer-Aided Design 37, 8 (2005), 869--877.

Digital Library

[23]

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. Proceedings of the VLDB Endowment 11, 13 (2018), 2209--2222.

Digital Library

[24]

DuckDB Labs. 2022. Patas Compression: Variation on Chimp. https://github.com/duckdb/duckdb/pull/5044. Accessed on: 2023-04--13.

[25]

Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In Proceedings of the 2016 International Conference on Management of Data. 311--326.

Digital Library

[26]

Seungyeon Lee, Jusuk Lee, Yongmin Kim, Kicheol Park, Jiman Hong, and Junyoung Heo. 2020. Efficient scheme for compressing and transferring data in hadoop clusters. In Proceedings of the 35th Annual ACM Symposium on Applied Computing. 1256--1263.

Digital Library

[27]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.

Digital Library

[28]

Ruiyuan Li, Zheng Li, Yi Wu, Chao Chen, and Yu Zheng. 2023. Elf: Erasing-based Lossless Floating-Point Compression. Proceedings of the VLDB Endowment 16, 7 (2023).

Digital Library

[29]

Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.

Digital Library

[30]

Peter Lindstrom and Martin Isenburg. 2006. Fast and efficient compression of floating-point data. IEEE transactions on visualization and computer graphics 12, 5 (2006), 1245--1250.

Digital Library

[31]

Adnan Alhomssi Maximilian Kuschewski, David Sauerwein and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Com-pression for Data Lakes. Proceedings of the 2023 ACM SIGMOD international conference on Management of data. https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf In press. Accessed on: 2023-04--13.

[32]

National Ecological Observatory Network (NEON). 2021. 2D wind speed and direction (DP1.00001.001). https: //doi.org/10.48443/S9YA-ZC81

[33]

National Ecological Observatory Network (NEON). 2021. Barometric pressure (DP1.00004.001). https://doi.org/10. 48443/RXR7-PP32

[34]

National Ecological Observatory Network (NEON). 2021. Dust and particulate size distribution (DP1.00017.001). https://doi.org/10.48443/4E6X-V373

[35]

National Ecological Observatory Network (NEON). 2021. IR biological temperature (DP1.00005.001). https://doi.org/ 10.48443/JNWY-B177

[36]

National Ecological Observatory Network (NEON). 2021. Relative humidity above water on-buoy (DP1.20271.001). https://doi.org/10.48443/Z99V-0502

[37]

Pedro Pedreira, Orri Erling, Masha Basmanova, Kevin Wilfong, Laith Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: meta's unified execution engine. Proceedings of the VLDB Endowment 15, 12 (2022), 3372--3384.

Digital Library

[38]

Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816--1827.

Digital Library

[39]

Johannes Pietrzyk, Annett Ungethüm, Dirk Habich, and Wolfgang Lehner. 2018. Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes. In Grundlagen von Datenbanken. 71--76.

[40]

Mark Raasveldt and Hannes Muehleisen. 2019. DuckDB. https://github.com/duckdb/duckdb Accesed on: 2023-04--13.

[41]

Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data. 1981--1984.

Digital Library

[42]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).

[43]

Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop Kang. 2023. CoEdIT: Text Editing by Task-Specific Instruction Tuning. (2023). arXiv:2305.09857 [cs.CL]

[44]

Vijayshankar Raman and Garret Swart. 2006. How to wring a table dry: Entropy compression of relations and querying of compressed relations. In Proceedings of the 32nd international conference on Very large data bases. Citeseer, 858--869.

[45]

Paruj Ratanaworabhan, Jian Ke, and Martin Burtscher. 2006. Fast lossless compression of scientific floating-point data. In Data Compression Conference (DCC'06). IEEE, 133--142.

Digital Library

[46]

Mark A Roth and Scott J Van Horn. 1993. Database compression. ACM Sigmod Record 22, 3 (1993), 31--39.

Digital Library

[47]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. 2023. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865 (2023).

[48]

Aliaksandr Valialkin. 2019. VictoriaMetrics: achieving better compression than Gorilla for time series data. https: //faun.pub/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932. Accesed on: 2023-04--13.

[49]

Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get real: How benchmarks fail to represent the real world. In Proceedings of the Workshop on Testing Database Systems. 1--6.

Digital Library

[50]

Deepak Vohra. 2016. Apache Parquet. 325--335. https://doi.org/10.1007/978--1--4842--2199-0_8

[51]

Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In 22nd International Conference on Data Engineering (ICDE'06). IEEE, 59--59.

Digital Library

Cited By

Pedreira PMajeti DErling O(2024)Composable Data Management: An Execution OverviewProceedings of the VLDB Endowment10.14778/3685800.368584717:12(4249-4252)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685847
Sakka LPedreira PErling OBasmanova MWilfong KHe WMeng XPai KVig B(2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685836
Li XGuo CQian KZhang MYang MXu M(2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698541

Index Terms

ALP: Adaptive Lossless floating-Point Compression
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression

Recommendations

Fast Lossless Compression of Seismic Floating-Point Data
IFITA '09: Proceedings of the 2009 International Forum on Information Technology and Applications - Volume 01

Modern seismic exploration produces vast amount of data that may be transferred between computers as well as to and from storage devices. Compression algorithms are then desirable to make seismic data processing more efficient in terms of storage and ...
Lossless-by-Lossy Coding for Scalable Lossless Image Compression

This paper presents a method of scalable lossless image compression by means of lossy coding. A progressive decoding capability and a full decoding for the lossless rendition are equipped with the losslessly encoded bit stream. Embedded coding is ...
Lossless Compression Using Efficient Encoding of Bitmasks
ISVLSI '09: Proceedings of the 2009 IEEE Computer Society Annual Symposium on VLSI

Lossless compression is widely used to improve both memory requirement and communication bandwidth in embedded systems. Dictionary based compression techniques are very popular because of their good compression efficiency and fast decompression ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
2,898
Total Downloads

Downloads (Last 12 months)2,898
Downloads (Last 6 weeks)400

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pedreira PMajeti DErling O(2024)Composable Data Management: An Execution OverviewProceedings of the VLDB Endowment10.14778/3685800.368584717:12(4249-4252)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685847
Sakka LPedreira PErling OBasmanova MWilfong KHe WMeng XPai KVig B(2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685836
Li XGuo CQian KZhang MYang MXu M(2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698541

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents