Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

ALP: Adaptive Lossless floating-Point Compression

Published: 12 December 2023 Publication History

Abstract

IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as Zstd; while [de]compression is much slower than Delta and FOR.
We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio (Figure 1). We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals [31] to losslessly encode doubles as integers if they originated as decimals, and otherwise uses vectorized compression of the doubles' front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, using building blocks provided by our FastLanes library [6], and an efficient two-stage compression algorithm that first samples row-groups and then vectors.

References

[1]
2019. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754--2019 (Revision of IEEE 754--2008) (2019), 1--84. https://doi.org/10.1109/IEEESTD.2019.8766229
[2]
2019. Public BI Benchmark. https://github.com/cwida/public_bi_benchmark. Accessed on: 2023-04--13.
[3]
2023. FastLanes. https://github.com/cwida/FastLanes Accesed on: 2023-04--13.
[4]
Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 671--682.
[5]
Azim Afroozeh and P Boncz. 2020. Towards a New File Format for Big Data: SIMD-Friendly Composable Compression.
[6]
Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding > 100 Billion Integers per Second with Scalar Code. Proc. VLDB Endow. 16, 9 (jul 2023), 2132--2144. https://doi.org/10.14778/3598581.3598587
[7]
Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Cidr, Vol. 5. 225--237.
[8]
Boudewijn Braams. 2018. Predicate Pushdown in Parquet and Apache Spark. MSc thesis (2018).
[9]
Andrea Bruno, Franco Maria Nardini, Giulio Ermanno Pibiri, Roberto Trani, and Rossano Venturini. 2021. TSXor: A Simple Time Series Compression Algorithm. In String Processing and Information Retrieval: 28th International Symposium, SPIRE 2021, Lille, France, October 4--6, 2021, Proceedings 28. Springer, 217--223.
[10]
Martin Burtscher and Paruj Ratanaworabhan. 2008. FPC: A high-speed compressor for double-precision floating-point data. IEEE transactions on computers 58, 1 (2008), 18--31.
[11]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. CoRR abs/2104.14294 (2021). arXiv:2104.14294 https://arxiv.org/abs/2104.14294
[12]
Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, et al. 2019. Procella: Unifying serving and analytical data at YouTube.(2019).
[13]
Yann Collet. 2014. LZ4 - Extremely fast compression. https://github.com/lz4/lz4 Accesed on: 2023-04--13.
[14]
Yann Collet. 2015. Zstandard - Fast real-time compression algorithm. https://github.com/facebook/zstd Accesed on: 2023-04--13.
[15]
Patrick Damme, Dirk Habich, Juliana Hildebrandt, and Wolfgang Lehner. 2017. Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses). In EDBT. 72--83.
[16]
Vadim Engelson, Peter Fritzson, and Dag Fritzson. 2000. Lossless compression of high-volume numerical data from simulations.
[17]
Agner Fog et al. 2011. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Copenhagen University College of Engineering 93 (2011), 110. https://www.agner.org/optimize/instruction_tables.pdf
[18]
Nathaniel Fout and Kwan-Liu Ma. 2012. An adaptive prediction-based approach to lossless compression of floating-point volume data. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2295--2304.
[19]
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR) 23, 1 (1991), 5--48.
[20]
Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings 14th International Conference on Data Engineering. IEEE, 370--379.
[21]
John L Gustafson and Isaac T Yonemoto. 2017. Beating floating point at its own game: Posit arithmetic. Supercomputing frontiers and innovations 4, 2 (2017), 71--86.
[22]
Martin Isenburg, Peter Lindstrom, and Jack Snoeyink. 2005. Lossless compression of predicted floating-point geometry. Computer-Aided Design 37, 8 (2005), 869--877.
[23]
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. Proceedings of the VLDB Endowment 11, 13 (2018), 2209--2222.
[24]
DuckDB Labs. 2022. Patas Compression: Variation on Chimp. https://github.com/duckdb/duckdb/pull/5044. Accessed on: 2023-04--13.
[25]
Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In Proceedings of the 2016 International Conference on Management of Data. 311--326.
[26]
Seungyeon Lee, Jusuk Lee, Yongmin Kim, Kicheol Park, Jiman Hong, and Junyoung Heo. 2020. Efficient scheme for compressing and transferring data in hadoop clusters. In Proceedings of the 35th Annual ACM Symposium on Applied Computing. 1256--1263.
[27]
Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.
[28]
Ruiyuan Li, Zheng Li, Yi Wu, Chao Chen, and Yu Zheng. 2023. Elf: Erasing-based Lossless Floating-Point Compression. Proceedings of the VLDB Endowment 16, 7 (2023).
[29]
Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.
[30]
Peter Lindstrom and Martin Isenburg. 2006. Fast and efficient compression of floating-point data. IEEE transactions on visualization and computer graphics 12, 5 (2006), 1245--1250.
[31]
Adnan Alhomssi Maximilian Kuschewski, David Sauerwein and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Com-pression for Data Lakes. Proceedings of the 2023 ACM SIGMOD international conference on Management of data. https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf In press. Accessed on: 2023-04--13.
[32]
National Ecological Observatory Network (NEON). 2021. 2D wind speed and direction (DP1.00001.001). https: //doi.org/10.48443/S9YA-ZC81
[33]
National Ecological Observatory Network (NEON). 2021. Barometric pressure (DP1.00004.001). https://doi.org/10. 48443/RXR7-PP32
[34]
National Ecological Observatory Network (NEON). 2021. Dust and particulate size distribution (DP1.00017.001). https://doi.org/10.48443/4E6X-V373
[35]
National Ecological Observatory Network (NEON). 2021. IR biological temperature (DP1.00005.001). https://doi.org/ 10.48443/JNWY-B177
[36]
National Ecological Observatory Network (NEON). 2021. Relative humidity above water on-buoy (DP1.20271.001). https://doi.org/10.48443/Z99V-0502
[37]
Pedro Pedreira, Orri Erling, Masha Basmanova, Kevin Wilfong, Laith Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: meta's unified execution engine. Proceedings of the VLDB Endowment 15, 12 (2022), 3372--3384.
[38]
Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816--1827.
[39]
Johannes Pietrzyk, Annett Ungethüm, Dirk Habich, and Wolfgang Lehner. 2018. Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes. In Grundlagen von Datenbanken. 71--76.
[40]
Mark Raasveldt and Hannes Muehleisen. 2019. DuckDB. https://github.com/duckdb/duckdb Accesed on: 2023-04--13.
[41]
Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data. 1981--1984.
[42]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
[43]
Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop Kang. 2023. CoEdIT: Text Editing by Task-Specific Instruction Tuning. (2023). arXiv:2305.09857 [cs.CL]
[44]
Vijayshankar Raman and Garret Swart. 2006. How to wring a table dry: Entropy compression of relations and querying of compressed relations. In Proceedings of the 32nd international conference on Very large data bases. Citeseer, 858--869.
[45]
Paruj Ratanaworabhan, Jian Ke, and Martin Burtscher. 2006. Fast lossless compression of scientific floating-point data. In Data Compression Conference (DCC'06). IEEE, 133--142.
[46]
Mark A Roth and Scott J Van Horn. 1993. Database compression. ACM Sigmod Record 22, 3 (1993), 31--39.
[47]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. 2023. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865 (2023).
[48]
Aliaksandr Valialkin. 2019. VictoriaMetrics: achieving better compression than Gorilla for time series data. https: //faun.pub/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932. Accesed on: 2023-04--13.
[49]
Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get real: How benchmarks fail to represent the real world. In Proceedings of the Workshop on Testing Database Systems. 1--6.
[50]
Deepak Vohra. 2016. Apache Parquet. 325--335. https://doi.org/10.1007/978--1--4842--2199-0_8
[51]
Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Super-scalar RAM-CPU cache compression. In 22nd International Conference on Data Engineering (ICDE'06). IEEE, 59--59.

Cited By

View all
  • (2024)Composable Data Management: An Execution OverviewProceedings of the VLDB Endowment10.14778/3685800.368584717:12(4249-4252)Online publication date: 1-Aug-2024
  • (2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
  • (2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024

Index Terms

  1. ALP: Adaptive Lossless floating-Point Compression

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 4
    PACMMOD
    December 2023
    1317 pages
    EISSN:2836-6573
    DOI:10.1145/3637468
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 December 2023
    Published in PACMMOD Volume 1, Issue 4

    Permissions

    Request permissions for this article.

    Author Tags

    1. big data formats
    2. columnar storage
    3. floating point compression
    4. lightweight compression
    5. lossless compression
    6. vectorized execution

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,898
    • Downloads (Last 6 weeks)400
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Composable Data Management: An Execution OverviewProceedings of the VLDB Endowment10.14778/3685800.368584717:12(4249-4252)Online publication date: 1-Aug-2024
    • (2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
    • (2024)Near-Lossless Gradient Compression for Data-Parallel Distributed DNN TrainingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698541(977-994)Online publication date: 20-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media