Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Persistent Summaries

Published: 18 August 2022 Publication History

Abstract

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.
Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy.
All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

References

[1]
Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the ACM Symposium on Principles of Database Systems. 10–20.
[2]
N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1 (1999), 137–147.
[3]
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. 2011. Streaming algorithms via precision sampling. In Proceedings of the IEEE Symposium on Foundations of Computer Science. IEEE, 363–372.
[4]
Anonymous. Details omitted due to double-blind reviewing.
[5]
Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems. 286–296.
[6]
Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 286–296.
[7]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the ACM Symposium on Principles of Database Systems.
[8]
B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. 1996. An asymptotically optimal multiversion B-tree. VLDB J. 5 (1996), 264–275.
[9]
R. Ben-Basat, G. Einziger, R. Friedman, and Y. Kassner. 2016. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’16). 1–9.
[10]
Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. 2009. Optimal sampling from sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems. 147–156.
[11]
G. S. Brodal, Spyros Sioutas, Konstantinos Tsakalidis, and Kostas Tsichlas. 2012. Fully persistent B-trees. Theoret. Comput. Sci. 841 (Nov. 2012), 10–26.
[12]
J. Lawrence Carter and Mark N. Wegman. 1977. Universal classes of hash functions. In Proceedings of the ACM Symposium on Theory of Computing. 106–112.
[13]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
[14]
Surajit Chaudhuri, Gautam Das, and Vivek R. Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Trans. Database Syst. 32, 2 (2007), 9.
[15]
Bernard Chazelle and Leonidas Guibas. 1986. Fractional cascading: I. A data structuring technique. In Algorithmica, Vol. 1.
[16]
Min Chen and Shigang Chen. 2015. Counter tree: A scalable counter architecture for per-flow traffic measurement. In Proceedings of the 23rd IEEE International Conference on Network Protocols (ICNP’15). IEEE, 111–122.
[17]
Baek-Young Choi, Jaesung Park, and Zhi-Li Zhang. 2002. Adaptive random sampling for load change detection. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems. ACM, 272–273.
[18]
Edith Cohen, Ofir Geri, and Rasmus Pagh. 2020. Composable sketches for functions of frequencies: Beyond the worst case. Retrieved from https://arxiv.org/abs/2004.04772.
[19]
Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55 (2005), 58–75.
[20]
Graham Cormode and S. Muthukrishnan. 2005. What’s hot and what’s not: Tracking most frequent items dynamically. ACM Trans. Database Syst. 30 (2005), 249–278.
[21]
Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6 (2002), 1794–1813.
[22]
Xenofontas A. Dimitropoulos, Paul Hurley, and Andreas Kind. 2008. Probabilistic lossy counting: An efficient algorithm for finding heavy hitters. Comput. Commun. Rev. 38, 1 (2008), 5.
[23]
Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 61–72.
[24]
James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. 1986. Making data structures persistent. In Proceedings of the 18th Annual ACM Symposium on Theory of Computing, Juris Hartmanis (Ed.). ACM, 109–121.
[25]
Wenfei Fan, Floris Geerts, Yang Cao, Ting Deng, and Ping Lu. 2015. Querying big data by accessing small data. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, 173–184.
[26]
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the International Conference on Very Large Data Bases, Vol. 1. 79–88.
[27]
S. Guha and A. McGregor. 2009. Stream order and order statistics: Quantile estimation in random-order streams. Soc. Industr. Appl. Math. J. Sci. Comput. 38 (2009).
[28]
Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. 2019. Learning-based frequency estimation algorithms. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=r1lohoCqY7.
[29]
Regant Y. S. Hung, Lap-Kei Lee, and Hing-Fung Ting. 2010. Finding frequent items over sliding windows with constant update time. Inform. Process. Lett. 110, 7 (2010), 257–260.
[30]
Regant Y. S. Hung and Hing-Fung Ting. 2008. Finding heavy hitters over the sliding window of a weighted data stream. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (Lecture Notes in Computer Science), Vol. 4957. Springer, 699–710.
[31]
Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily approximating complex AdHoc queries in BigData clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 631–646.
[32]
Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28 (2003), 51–55.
[33]
Tao Li, Shigang Chen, and Yibei Ling. 2012. Per-flow traffic measurement through randomized counter sharing. IEEE/ACM Trans. Netw. 20, 5 (2012), 1622–1634.
[34]
D. Lomet, Roger Barga, Mohamed F. Mokbel, German Shegalov, Rui Wang, and Yunyue Zhu. 2005. Immortal DB: Transaction time support for SQL server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 939–941.
[35]
David Lomet and Betty Salzberg. 1989. Access methods for multiversion data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 315–324.
[36]
David B. Lomet and Feifei Li. 2009. Improving transaction-time dbms performance and functionality. In Proceedings of the IEEE International Conference on Data Engineering. 581–591.
[37]
Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, 346–357.
[38]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31 (2006), 1095–1133.
[39]
J. Misra and D. Gries. 1982. Finding repeated elements. 2 (1982), 143–152.
[40]
S. Muthukrishnan. 2003. Data streams: Algorithms and applications. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). SIAM, 413.
[41]
Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In Proceedings of the IEEE International Conference on Data Mining Workshops. 170–177.
[42]
Joseph O’Rourke. 1981. An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24 (1981), 574–578.
[43]
Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, and Aoying Zhou. 2018. Persistent bloom filter: Membership testing for the entire history. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1037–1052.
[44]
Christian Plattner, Andreas Wapf, and Gustavo Alonso. 2006. Searching in time. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 754–756.
[45]
Florin Rusu and Alin Dobra. 2007. Statistical analysis of sketch estimators. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 187–198.
[46]
Anish Das Sarma, Martin Theobald, and Jennifer Widom. 2010. LIVE: A lineage-supported versioned DBMS. In Scientific and Statistical Database Management. Springer, 416–433.
[47]
Ross Shaull, Liuba Shrira, and Hao Xu. 2008. Skippy: A new snapshot indexing method for time travel in the storage manager. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 637–648.
[48]
Liuba Shrira and Hao Xu. 2005. Snap: Efficient snapshots for back-in-time execution. In Proceedings of the IEEE International Conference on Data Engineering. 434–445.
[49]
Yufei Tao, Ke Yi, Cheng Sheng, Jian Pei, and Feifei Li. 2010. Logging every footstep: Quantile summaries for the entire history. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 639–650.
[50]
P. J. Varman and R. M. Verma. 1997. An efficient multiversion access structure. IEEE Trans. Knowl. Data Eng. 9 (1997), 391–409.
[51]
Hao Wu, Junhao Gan, and Rui Zhang. 2020. Learning based distributed tracking. Retrieved from https://arxiv.org/abs/2006.12943.
[52]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the ACM Symposium on Operating Systems Principles. 423–438.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 47, Issue 3
September 2022
173 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3544001
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2022
Online AM: 23 May 2022
Accepted: 01 April 2022
Revised: 01 August 2021
Received: 01 August 2020
Published in TODS Volume 47, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Summaries
  2. streaming algorithms
  3. sketches

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Beijing Natural Science Foundation
  • National Natural Science Foundation of China
  • PCL
  • Beijing Outstanding Young Scientist Program
  • Alibaba Group through Alibaba Innovative Research Program, by CCF-Baidu Open Fund
  • HKRGC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 522
    Total Downloads
  • Downloads (Last 12 months)130
  • Downloads (Last 6 weeks)32
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media