research-article

Persistent Summaries

Authors:

Ji-Rong WenAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 47, Issue 3

Article No.: 11, Pages 1 - 42

https://doi.org/10.1145/3531053

Published: 18 August 2022 Publication History

Abstract

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.

Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy.

All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

References

[1]

Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the ACM Symposium on Principles of Database Systems. 10–20.

Digital Library

[2]

N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1 (1999), 137–147.

[3]

Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. 2011. Streaming algorithms via precision sampling. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE, 363–372.

Digital Library

[4]

Anonymous. Details omitted due to double-blind reviewing.

[5]

Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems. 286–296.

Digital Library

[6]

Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 286–296.

Digital Library

[7]

Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the ACM Symposium on Principles of Database Systems.

Digital Library

[8]

B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. 1996. An asymptotically optimal multiversion B-tree. VLDB J. 5 (1996), 264–275.

[9]

R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. 2016. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’16). 1–9.

[10]

Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. 2009. Optimal sampling from sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems. 147–156.

Digital Library

[11]

G. S. Brodal, Spyros Sioutas, Konstantinos Tsakalidis, and Kostas Tsichlas. 2012. Fully persistent B-trees. Theoret. Comput. Sci. 841 (Nov. 2012), 10–26.

[12]

J. Lawrence Carter and Mark N. Wegman. 1977. Universal classes of hash functions. In Proceedings of the ACM Symposium on Theory of Computing. 106–112.

[13]

Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Springer, 693–703.

[14]

Surajit Chaudhuri, Gautam Das, and Vivek R. Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst. 32, 2 (2007), 9.

Digital Library

[15]

Bernard Chazelle and Leonidas Guibas. 1986. Fractional cascading: I. A data structuring technique. In Algorithmica, Vol. 1.

[16]

Min Chen and Shigang Chen. 2015. Counter tree: A scalable counter architecture for per-flow traffic measurement. In Proceedings of the 23rd IEEE International Conference on Network Protocols (ICNP’15). IEEE, 111–122.

[17]

Baek-Young Choi, Jaesung Park, and Zhi-Li Zhang. 2002. Adaptive random sampling for load change detection. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems. ACM, 272–273.

Digital Library

[18]

Edith Cohen, Ofir Geri, and Rasmus Pagh. 2020. Composable sketches for functions of frequencies: Beyond the worst case. Retrieved from https://arxiv.org/abs/2004.04772.

[19]

Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55 (2005), 58–75.

[20]

Graham Cormode and S. Muthukrishnan. 2005. What’s hot and what’s not: Tracking most frequent items dynamically. ACM Trans. Database Syst. 30 (2005), 249–278.

[21]

Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6 (2002), 1794–1813.

Digital Library

[22]

Xenofontas A. Dimitropoulos, Paul Hurley, and Andreas Kind. 2008. Probabilistic lossy counting: An efficient algorithm for finding heavy hitters. Comput. Commun. Rev. 38, 1 (2008), 5.

Digital Library

[23]

Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 61–72.

Digital Library

[24]

James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. 1986. Making data structures persistent. In Proceedings of the 18th Annual ACM Symposium on Theory of Computing, Juris Hartmanis (Ed.). ACM, 109–121.

Digital Library

[25]

Wenfei Fan, Floris Geerts, Yang Cao, Ting Deng, and Ping Lu. 2015. Querying big data by accessing small data. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, 173–184.

Digital Library

[26]

Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the International Conference on Very Large Data Bases, Vol. 1. 79–88.

[27]

S. Guha and A. McGregor. 2009. Stream order and order statistics: Quantile estimation in random-order streams. Soc. Industr. Appl. Math. J. Sci. Comput. 38 (2009).

[28]

Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-based frequency estimation algorithms. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=r1lohoCqY7.

[29]

Regant Y. S. Hung, Lap-Kei Lee, and Hing-Fung Ting. 2010. Finding frequent items over sliding windows with constant update time. Inform. Process. Lett. 110, 7 (2010), 257–260.

Digital Library

[30]

Regant Y. S. Hung and Hing-Fung Ting. 2008. Finding heavy hitters over the sliding window of a weighted data stream. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (Lecture Notes in Computer Science), Vol. 4957. Springer, 699–710.

[31]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily approximating complex AdHoc queries in BigData clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 631–646.

Digital Library

[32]

Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28 (2003), 51–55.

[33]

Tao Li, Shigang Chen, and Yibei Ling. 2012. Per-flow traffic measurement through randomized counter sharing. IEEE/ACM Trans. Netw. 20, 5 (2012), 1622–1634.

Digital Library

[34]

D. Lomet, Roger Barga, Mohamed F. Mokbel, German Shegalov, Rui Wang, and Yunyue Zhu. 2005. Immortal DB: Transaction time support for SQL server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 939–941.

Digital Library

[35]

David Lomet and Betty Salzberg. 1989. Access methods for multiversion data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 315–324.

Digital Library

[36]

David B. Lomet and Feifei Li. 2009. Improving transaction-time dbms performance and functionality. In Proceedings of the IEEE International Conference on Data Engineering. 581–591.

Digital Library

[37]

Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, 346–357.

Digital Library

[38]

Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31 (2006), 1095–1133.

[39]

J. Misra and D. Gries. 1982. Finding repeated elements. 2 (1982), 143–152.

[40]

S. Muthukrishnan. 2003. Data streams: Algorithms and applications. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). SIAM, 413.

[41]

Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170–177.

Digital Library

[42]

Joseph O’Rourke. 1981. An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24 (1981), 574–578.

Digital Library

[43]

Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, and Aoying Zhou. 2018. Persistent bloom filter: Membership testing for the entire history. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1037–1052.

Digital Library

[44]

Christian Plattner, Andreas Wapf, and Gustavo Alonso. 2006. Searching in time. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 754–756.

Digital Library

[45]

Florin Rusu and Alin Dobra. 2007. Statistical analysis of sketch estimators. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 187–198.

Digital Library

[46]

Anish Das Sarma, Martin Theobald, and Jennifer Widom. 2010. LIVE: A lineage-supported versioned DBMS. In Scientific and Statistical Database Management. Springer, 416–433.

[47]

Ross Shaull, Liuba Shrira, and Hao Xu. 2008. Skippy: A new snapshot indexing method for time travel in the storage manager. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 637–648.

Digital Library

[48]

Liuba Shrira and Hao Xu. 2005. Snap: Efficient snapshots for back-in-time execution. In Proceedings of the IEEE International Conference on Data Engineering. 434–445.

Digital Library

[49]

Yufei Tao, Ke Yi, Cheng Sheng, Jian Pei, and Feifei Li. 2010. Logging every footstep: Quantile summaries for the entire history. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 639–650.

Digital Library

[50]

P. J. Varman and R. M. Verma. 1997. An efficient multiversion access structure. IEEE Trans. Knowl. Data Eng. 9 (1997), 391–409.

[51]

Hao Wu, Junhao Gan, and Rui Zhang. 2020. Learning based distributed tracking. Retrieved from https://arxiv.org/abs/2006.12943.

[52]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the ACM Symposium on Operating Systems Principles. 423–438.

Digital Library

Cited By

Index Terms

Persistent Summaries
1. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

Sketching distributed sliding-window data streams

While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Space-efficient tracking of persistent items in a massive data stream

Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur 'regularly' in the stream. In contrast with heavy hitters, persistent items do not necessarily ...
Space-efficient tracking of persistent items in a massive data stream
DEBS '11: Proceedings of the 5th ACM international conference on Distributed event-based system

Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur "regularly" in the stream. In contrast with heavy-hitters, persistent items do not necessarily ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 47, Issue 3

September 2022

173 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3544001

Editor:
Christopher Jermaine
Rice University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2022

Online AM: 23 May 2022

Accepted: 01 April 2022

Revised: 01 August 2021

Received: 01 August 2020

Published in TODS Volume 47, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Beijing Natural Science Foundation
National Natural Science Foundation of China
PCL
Beijing Outstanding Young Scientist Program
Alibaba Group through Alibaba Innovative Research Program, by CCF-Baidu Open Fund
HKRGC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
522
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)32

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents