Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/GLOBECOM38437.2019.9013957guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Lossless Compression of Time Series Data with Generalized Deduplication

Published: 01 December 2019 Publication History

Abstract

To provide compressed storage for large amounts of time series data, we present a new strategy for data deduplication. Rather than attempting to deduplicate entire data chunks, we employ a generalized approach, where each chunk is split into a part worth deduplicating and a part that must be stored directly. This simple principle enables a greater compression of the often similar, non-identical, chunks of time series data than is the case for classic deduplication, while keeping benefits such as scalability, robustness, and on-the-fly storage, retrieval, and search for chunks. We analyze the method's theoretical performance, and argue that our method can asymptotically approach the entropy limit for some data configurations. To validate the method's practical merits, we finally show that it is competitive when compared to popular universal compression algorithms on the MIT-BIH ECG Compression Test Database.

References

[1]
J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343, may 1977.
[2]
J. Ziv, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inf. Theory, vol. 24, no. 5, pp. 530–536, sep 1978.
[3]
W. Xia, H. Jiang, D. Feng et al., “A Comprehensive Study of the Past, Present, and Future of Data Deduplication,” Proc. IEEE, vol. 104, no. 9, pp. 1681–1710, 2016.
[4]
W. Xia, H. Jiang, D. Feng, and L. Tian, “Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets,” in Data Compression Conf., Mar 2014, pp. 203–212.
[5]
R. Vestergaard, Q. Zhang, and D. E. Lucani, “Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties,” in IEEE GLOBECOM, Waikoloa, USA, 2019.
[6]
R. Vestergaard, D. E. Lucani, and Q. Zhang, “Generalized Deduplication: Lossless Compression for Large Amounts of Small IoT Data,” in European Wireless Conf., Aarhus, Denmark, May 2019.
[7]
T. M. Cover and J. A. Thomas, Elements of information theory. Wiley-Interscience, 2006.
[8]
A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, no. 6, p. 66138, jun 2004.
[9]
P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inf. Theory, vol. 21, no. 2, pp. 194–203, mar 1975.
[10]
I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, jun 1987.
[11]
G. B. Moody, R. G. Mark, and A. L. Goldberger, “Evaluation of the ‘TRIM’ ECG data compressor,” in Proc. Computers in Cardiology, 1988.
[12]
A. Goldberger, L. Amaral, L. Glass et al., “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals,” Circulation, vol. 101, pp. E215–20, 2000.
[13]
P. Deutsch, “GZIP File Format Specification Version 4.3,” Internet Requests for Comments, RFC, 1996.
[14]
D. Salomon and G. Motta, Handbook of Data Compression. London: Springer, 2010.
[15]
D. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, sep 1952.

Cited By

View all
  • (2020)ZipLineProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies10.1145/3386367.3431302(399-405)Online publication date: 23-Nov-2020

Index Terms

  1. Lossless Compression of Time Series Data with Generalized Deduplication
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    2019 IEEE Global Communications Conference (GLOBECOM)
    6544 pages

    Publisher

    IEEE Press

    Publication History

    Published: 01 December 2019

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)ZipLineProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies10.1145/3386367.3431302(399-405)Online publication date: 23-Nov-2020

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media