Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Together is Better: Heavy Hitters Quantile Estimation

Published: 30 May 2023 Publication History

Abstract

Stream monitoring is fundamental in many data stream applications, such as financial data trackers, security, anomaly detection, and load balancing. In that respect, quantiles are of particular interest, as they often capture the user's utility. For example, if a video connection has high tail (e.g., 99'th percentile) latency, the perceived quality will suffer, even if the average and median latencies are low.
In this work, we consider the problem of approximating the per-item quantiles. Elements in our stream are (ID, value) tuples, and we wish to track the quantiles for each ID. Existing quantile sketches are designed for a plain number stream (i.e., containing just a value). While one could allocate a separate sketch instance for each ID, this may require an infeasible amount of memory. Instead, we consider tracking the quantiles for the heavy hitters (most frequent items), which are often considered particularly important, without knowing them beforehand.
We first present a couple of simple and effective algorithms that serve as baselines, a sampling approach and a sketching approach. Then, we present SQUAD, an algorithm that combines sampling and sketching while improving the asymptotic space complexity. Intuitively, SQUAD uses a background sampling process to capture the behaviour of the quantiles of an item before it is allocated with a sketch, thereby allowing us to use fewer samples and sketches. The algorithms are rigorously analyzed, and we demonstrate SQUAD's superiority using extensive~simulations on real-world traces.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023

References

[1]
Handling elephant flow on a dpdk-based load balancer - https://dpdksummitapac2021.sched.com/event/hdlm/handling-elephant-flow-on-a-dpdk-based-load-balancer.
[2]
Open source code. https://anonymous.4open.science/r/squad-3BB9.
[3]
The Network Simulator ns-3. https://www.nsnam.org/research/wns3/wns3--2015/.
[4]
Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):1--28, 2013.
[5]
Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference, pages 63--74, 2010.
[6]
Daniel Anderson, Pryce Bevan, Kevin Lang, Edo Liberty, Lee Rhodes, and Justin Thaler. A high-performance algorithm for identifying frequent items in data streams. In Proceedings of the 2017 Internet Measurement Conference, pages 268--282. ACM, 2017.
[7]
Arvind Arasu and Gurmeet Singh Manku. Approximate counts and quantiles over sliding windows. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 286--296, 2004.
[8]
Ran Ben Basat, Gil Einziger, Shir Landau Feibish, Jalil Moraney, and Danny Raz. Network-wide routing-oblivious heavy hitters. In Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems, pages 66--73, 2018.
[9]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Volumetric hierarchical heavy hitters. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 381--392. IEEE, 2018.
[10]
Ran Ben Basat, Gil Einziger, Marcelo Caggiani Luizelli, and Erez Waisbard. A black-box method for accelerating measurement algorithms with accuracy guarantees. In 2019 IFIP Networking Conference (IFIP Networking), pages 1--9. IEEE, 2019.
[11]
Ran Ben Basat, Gil Einziger, and Roy Friedman. Space efficient elephant flow detection. In Proceedings of the 11th ACM International Systems and Storage Conference, pages 115--115, 2018.
[12]
Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Optimal elephant flow detection. In IEEE INFOCOM, 2017.
[13]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 127--140, 2017.
[14]
Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, Minian Yu, and Michael Mitzenmacher. Pint: Probabilistic in-band network telemetry. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 662--680, 2020.
[15]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems, 31(3), aug 2013.
[16]
Graham Cormode, Minos Garofalakis, Shanmugavelayutham Muthukrishnan, and Rajeev Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 25--36, 2005.
[17]
Graham Cormode and Marios Hadjieleftheriou. Methods for Finding Frequent Items in Data Streams. J. VLDB, 19(1):3--20, 2010.
[18]
Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Vesely. Relative error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 96--108, 2021.
[19]
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 155--166, 2004.
[20]
Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Srivastava. Space-and time-efficient deterministic algorithms for biased quantiles over data streams. In Proceedings of the twenty-fifth ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, pages 263--272, 2006.
[21]
Graham Cormode and Pavel Vesel
[22]
y. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 81--93, 2020.
[23]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles, SOSP, page 205--220, 2007.
[24]
Pavlos S Efraimidis. Weighted random sampling over data streams. In Algorithms, Probability, Networks, and Games, pages 183--195. Springer, 2015.
[25]
Cristian Estan, Ken Keys, David Moore, and George Varghese. Building a Better NetFlow. In ACM SIGCOMM, 2004.
[26]
David Felber and Rafail Ostrovsky. A randomized online quantile summary in > (1 Y log(1 Y)) words. Theory of Computing, 13(1):1--17, 2017.
[27]
Naga K Govindaraju, Nikunj Raghuvanshi, and Dinesh Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 611--622, 2005.
[28]
Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58--66, 2001.
[29]
Michael B Greenwald and Sanjeev Khanna. Power-conserving computation of order-statistics over sensor networks. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 275--285, 2004.
[30]
Anupam Gupta and Francis Zane. Counting inversions in lists. In SODA, volume 3, pages 253--254, 2003.
[31]
Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 745--756, 2011.
[32]
Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, and Vladimir Braverman. Streaming quantiles algorithms with small space and update time. arXiv preprint arXiv:1907.00236, 2019.
[33]
Zohar Karnin, Kevin Lang, and Edo Liberty. Optimal quantile approximation in streams. In 2016 ieee 57th annual symposium on foundations of computer science (focs), pages 71--78. IEEE, 2016.
[34]
Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A Simple Algorithm for Finding Frequent Elements in Streams and Bags. ACM Transactions Database Systems, 2003.
[35]
Jonatan Langlet, Ran Ben-Basat, Sivaramakrishnan Ramanathan, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, and Gianni Antichi. Zero-cpu collection with direct telemetry access. In Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks, pages 108--115, 2021.
[36]
Kim-Hung Li. Reservoir-sampling algorithms of time complexity o (n (1 log (n/n))). ACM Transactions on Mathematical Software (TOMS), 20(4):481--493, 1994.
[37]
Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. Nitrosketch: Robust and general sketch-based monitoring in software switches. In Proceedings of the ACM Special Interest Group on Data Communication, pages 334--350. 2019.
[38]
Kristina Bennett Liz Fong-Jones and Stephen Thorne. Developing Effective Service Level Indicators and Service Level Objectives. In SRECon, 2018.
[39]
Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449--472, 2016.
[40]
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426--435, 1998.
[41]
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. ACM SIGMOD Record, 28(2):251--262, 1999.
[42]
Andrew McGregor, A Pavan, Srikanta Tirthapura, and David P Woodruff. Space-efficient estimation of statistics over sub-sampled streams. Algorithmica, 74(2):787--811, 2016.
[43]
Kathryn McKinley. Measuring and optimizing tail latency. In Strange Loop Conference, 2017. https://www.youtube. com/watch?v=_Zoa3xkzgFk.
[44]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398--412. Springer, 2005.
[45]
Jayadev Misra and David Gries. Finding repeated elements. Technical report, 1982.
[46]
J Ian Munro and Mike S Paterson. Selection and sorting with limited storage. Theoretical computer science, 12(3):315--323, 1980.
[47]
Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. Language-directed hardware design for network performance monitoring. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 85--98, 2017.
[48]
Bruno Ribeiro, Tao Ye, and Don Towsley. A resource-minimalist flow size histogram estimator. In Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pages 285--290, 2008.
[49]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C Snoeren. Inside the social network's (datacenter) network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 123--137, 2015.
[50]
Eric Schurman and Jake Brutlag. Performance related changes and their user impact. In Velocity Web Performance and Operations Conference, 2009. https://www.youtube.com/watch?v=bQSE51-gr2s.
[51]
Robert J Serffing. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, pages 39--48, 1974.
[52]
Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of the 2nd international conference on Embedded networked sensor systems, pages 239--249, 2004.
[53]
The P4.org Applications Working Group. In-band Network Telemetry (INT) Dataplane Specification. https://github.com/p4lang/p4-applications/blob/master/docs/telemetry_report.pdf.
[54]
Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11--30. Springer, 2015. First published in 1971.
[55]
Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985.
[56]
Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica, 65(1):206--223, 2013.
[57]
Qi Zhang and Wei Wang. An efficient algorithm for approximate biased quantile computation in data streams. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 1023--1026, 2007.
[58]
Ying Zhang, Xuemin Lin, Jian Xu, Flip Korn, and Wei Wang. Space-efficient relative error order sketch over data streams. In 22nd International Conference on Data Engineering (ICDE'06), pages 51--51. IEEE, 2006.

Cited By

View all
  • (2024)SQUID: Faster Analytics via Sampled Quantile EstimationProceedings of the ACM on Networking10.1145/36768732:CoNEXT3(1-23)Online publication date: 21-Aug-2024
  • (2024)Online Detection of Outstanding Quantiles with QuantileFilter2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00069(831-844)Online publication date: 13-May-2024
  • (2023)SPADA: A Sparse Approximate Data Structure Representation for Data Plane Per-flow MonitoringProceedings of the ACM on Networking10.1145/36291491:CoNEXT3(1-25)Online publication date: 28-Nov-2023

Index Terms

  1. Together is Better: Heavy Hitters Quantile Estimation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 1
    PACMMOD
    May 2023
    2807 pages
    EISSN:2836-6573
    DOI:10.1145/3603164
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2023
    Published in PACMMOD Volume 1, Issue 1

    Permissions

    Request permissions for this article.

    Author Tags

    1. data stream algorithms
    2. quantiles
    3. sketches

    Qualifiers

    • Research-article

    Funding Sources

    • ISF
    • Technion HPI research school

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)146
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SQUID: Faster Analytics via Sampled Quantile EstimationProceedings of the ACM on Networking10.1145/36768732:CoNEXT3(1-23)Online publication date: 21-Aug-2024
    • (2024)Online Detection of Outstanding Quantiles with QuantileFilter2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00069(831-844)Online publication date: 13-May-2024
    • (2023)SPADA: A Sparse Approximate Data Structure Representation for Data Plane Per-flow MonitoringProceedings of the ACM on Networking10.1145/36291491:CoNEXT3(1-25)Online publication date: 28-Nov-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media