Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

In the land of data streams where synopses are missing, one framework to bring them all

Published: 01 June 2021 Publication History

Abstract

In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class citizens, i.e., as pipeline operators. Synopses' implementation is upon users. This is mainly because of the diversity of synopses, which makes a unified implementation difficult. We present Condor, a framework that supports synopses as first class citizens. Condor facilitates the specification and processing of synopsis-based streaming jobs while hiding all internal processing details. Condor's key component is its model that represents synopses as a particular case of windowed aggregate functions. An inherent divide and conquer strategy allows Condor to efficiently distribute the computation, allowing for high-performance and linear scalability. Our evaluation shows that Condor outperforms existing approaches by up to a factor of 75x and that it scales linearly with the number of cores.

References

[1]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (2015), 557--581.
[2]
Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 26.
[3]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.
[4]
Charu C Aggarwal. 2006. On biased reservoir sampling in the presence of stream evolution. In Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 607--618.
[5]
Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! PVLDB 11, 11 (2018), 1414--1427.
[6]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. (2015).
[7]
Apache Software Foundation. 2020. Apache Hive. https://hive.apache.org/
[8]
Apache Software Foundation. 2020. Apache Pig. https://pig.apache.org/
[9]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 1--16.
[10]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
[11]
Andrei Broder, Michael Mitzenmacher, and Andrei Broder I Michael Mitzenmacher. 2002. Network applications of bloom filters: A survey. In Internet mathematics. Citeseer.
[12]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[13]
Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate sharing for user-defined windows. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1201--1210.
[14]
Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Jerry Peng, et al. 2016. Benchmarking streaming computation engines: Storm, flink and spark streaming. In 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 1789--1792.
[15]
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Nsdi, Vol. 10. 20.
[16]
Graham Cormode, Antonios Deligiannakis, Minos Garofalakis, and Andrew McGregor. 2009. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment 2, 1 (2009), 526--537.
[17]
Graham Cormode and Minos Garofalakis. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 13--24.
[18]
Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (2012), 1--294.
[19]
Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530--1541.
[20]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.
[21]
Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. ACM, 75--88.
[22]
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.
[23]
Apache Flink. 2020. The Broadcast State Pattern. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
[24]
Apache Flink. 2020. Physical Partitioning. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/#physical-partitioning
[25]
Minos N Garofalakis and Phillip B Gibbons. 2001. Approximate Query Processing: Taming the TeraBytes. In VLDB. 343--352.
[26]
Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. 1997. Aqua project white paper. Technical Report. Technical report, Bell Laboratories, Murray Hill, New Jersey.
[27]
Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. 2002. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems (TODS) 27, 3 (2002), 261--298.
[28]
Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. 383--397.
[29]
Alfred Haar. 1909. Zur theorie der orthogonalen funktionensysteme. Georg-August-Universitat, Gottingen.
[30]
Paulo Jesus, Carlos Baquero, and Paulo Sérgio Almeida. 2014. A survey of distributed data aggregation algorithms. IEEE Communications Surveys & Tutorials 17, 1 (2014), 381--404.
[31]
Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In Proceedings of the 2016 international conference on management of data. 631--646.
[32]
Panagiotis Karras and Nikos Mamoulis. 2005. One-pass wavelet synopses for maximum-error metrics. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 421--432.
[33]
Martin Kiefer, Ilias Poulakis, Sebastian Breß, and Volker Markl. 2020. Scotch: Generating FPGA-Accelerators for Sketching at Line Rate. Proceedings of the VLDB Endowment 14, 3 (2020), 281--293.
[34]
Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi. 2019. Big data stream analysis: a systematic literature review. Journal of Big Data 6, 1 (2019), 47.
[35]
Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 623--634.
[36]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. Acm Sigmod Record 34, 1 (2005), 39--44.
[37]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 311--322.
[38]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 311--322.
[39]
Kaiyu Li and Guoliang Li. 2018. Approximate query processing: What is new and where to go? Data Science and Engineering 3, 4 (2018), 379--397.
[40]
Charles Masson, Jee E Rim, and Homin K Lee. 2019. DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees. Proceedings of the VLDB Endowment 12, 12 (2019), 2195--2205.
[41]
Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV).
[42]
Barzan Mozafari, Jags Ramnarayan, Sudhir Menon, Yogesh Mahajan, Soubhik Chakraborty, Hemant Bhanawat, and Kishor Bachhav. 2017. SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics. In CIDR.
[43]
Shanmugavelayutham Muthukrishnan. 2005. Data streams: Algorithms and applications. Now Publishers Inc.
[44]
Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. ACM Sigmod Record 14, 2 (1984), 256--276.
[45]
Viswanath Poosala, Peter J Haas, Yannis E Ioannidis, and Eugene J Shekita. 1996. Improved histograms for selectivity estimation of range predicates. ACM Sigmod Record 25, 2 (1996), 294--305.
[46]
Do Le Quoc, Ruichuan Chen, Pramod Bhatotia, Christof Fetzer, Volker Hilt, and Thorsten Strufe. 2017. StreamApprox: approximate computing for stream analytics. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, 185--197.
[47]
Madhavapeddi Shreedhar and George Varghese. 1995. Efficient fair queueing using deficit round robin. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. 231--242.
[48]
Apache Spark. 2020. Scheduling Within an Application. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
[49]
Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, and Michael Stonebraker. 2003. Load shedding in a data stream manager. In Proceedings 2003 vldb conference. Elsevier, 309--320.
[50]
NYC Taxi and Limousine Commission (TLC). 2020. New York City Taxi and Limousine Commission (TLC) Trip Record Data. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
[51]
Jonas Traub, Philipp Marian Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2018. Scotty: Efficient window aggregation for out-of-order stream processing. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1300--1303.
[52]
Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient Window Aggregation with General Stream Slicing. In 22th International Conference on Extending Database Technology (EDBT).
[53]
Jonas Traub, Philipp M. Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2020. Scotty Window Processor. https://doi.org/TU-Berlin-DIMA/scotty-window-processor
[54]
Jonas Traub, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2020. Agora: Bringing Together Datasets, Algorithms, Models and More in a Unified Ecosystem [Vision]. SIGMOD Record 49, 4 (2020), 6--11.
[55]
Jonas Traub, Nikolaas Steenbergen, Philipp M Grulich, Tilmann Rabl, and Volker Markl. 2017. I2: Interactive Real-Time Visualization for Streaming Data. In EDBT. 526--529.
[56]
Jan E Trost. 1986. Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative sociology 9, 1 (1986), 54--57.
[57]
Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11, 1 (1985), 37--57.
[58]
Yahoo! 2020. DataSketches: Sketches Library from Yahoo! https://datasketches.github.io/
[59]
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.

Cited By

View all
  • (2024)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 1-Jun-2024
  • (2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
  • (2023)Survey of window types for aggregation in stream processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00778-632:5(985-1011)Online publication date: 17-Feb-2023
  • Show More Cited By

Index Terms

  1. In the land of data streams where synopses are missing, one framework to bring them all
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 10
    June 2021
    219 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 June 2021
    Published in PVLDB Volume 14, Issue 10

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 1-Jun-2024
    • (2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
    • (2023)Survey of window types for aggregation in stream processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00778-632:5(985-1011)Online publication date: 17-Feb-2023
    • (2022)Distributed real-time ETL architecture for unstructured big dataKnowledge and Information Systems10.1007/s10115-022-01757-764:12(3419-3445)Online publication date: 16-Sep-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media