Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3219819.3219978acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

HeavyGuardian: Separate and Guard Hot Items in Data Streams

Published: 19 July 2018 Publication History

Abstract

Data stream processing is a fundamental issue in many fields, such as data mining, databases, network traffic measurement. There are five typical tasks in data stream processing: frequency estimation, heavy hitter detection, heavy change detection, frequency distribution estimation, and entropy estimation. Different algorithms are proposed for different tasks, but they seldom achieve high accuracy and high speed at the same time. To address this issue, we propose a novel data structure named HeavyGuardian. The key idea is to intelligently separate and guard the information of hot items while approximately record the frequencies of cold items. We deploy HeavyGuardian on the above five typical tasks. Extensive experimental results show that HeavyGuardian achieves both much higher accuracy and higher speed than the state-of-the-art solutions for each of the five typical tasks. The source codes of HeavyGuardian and other related algorithms are available at GitHub.

Supplementary Material

MP4 File (gong_items_in_data_streams.mp4)

References

[1]
The source codes of heavyguardian and other related algorithms. https://github.com/Gavindeed/HeavyGuardianvspace0.03in.
[2]
Shoba Venkataraman, Dawn Song, Phillip B Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. Department of Electrical and Computing Engineering, page 6, 2005.
[3]
Elisa Bertino. Introduction to data security and privacy. Data Science and Engineering, 1(3):125--126, 2016.
[4]
Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29--42, 2013.
[5]
Ben Chen, Zhijin Lv, Xiaohui Yu, and Yang Liu. Sliding window top-k monitoring over distributed data streams. Data Science and Engineering, 2(4):289--300, 2017.
[6]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398--412. Springer, 2005.
[7]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ¶roc Springer ICDT, 2005.
[8]
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 464--475, 2003.
[9]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. arXiv preprint arXiv:1707.06778, 2017.
[10]
Nan Tang, Qing Chen, and Prasenjit Mitra. Graph stream summarization: From big bang to big crunch. In Proceedings of the 2016 International Conference on Management of Data, pages 1481--1496. ACM, 2016.
[11]
Graham Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.
[12]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In ¶roc SIGMOD, 2016.
[13]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, 10(11):1442--1453, 2017.
[14]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In NSDI, pages 311--324, 2016.
[15]
Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270--313, 2003.
[16]
Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009.
[17]
Theophilus Benson, Aditya Akella, and David A Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 267--280. ACM, 2010.
[18]
Graham Cormode, Balachander Krishnamurthy, and Walter Willinger. A manifesto for modeling and measurement in social media. First Monday, 15(9), 2010.
[19]
Dave Maltz. Unraveling the complexity of network management. 2009.
[20]
Ilker Nadi Bozkurt, Yilun Zhou, Theophilus Benson, Bilal Anwer, Dave Levin, Nick Feamster, Aditya Akella, Balakrishnan Chandrasekaran, Cheng Huang, Bruce Maggs, et al. Dynamic prioritization of traffic in home networks. 2015.
[21]
Jiecao Chen and Qin Zhang. Bias-aware sketches. Proceedings of the VLDB Endowment, 10(9):961--972, 2017.
[22]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Automata, languages and programming, pages 784--784, 2002.
[23]
Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005.
[24]
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. Cold filter: A meta-framework for faster and more accurate stream processing.
[25]
Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395--414, 2015.
[26]
Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
[27]
Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, pages 164--176. ACM, 2017.
[28]
Mohamed A Soliman, Ihab F Ilyas, and Kevin Chen-Chuan Chang. Top-k query processing in uncertain databases. In IEEE 23rd International Conference on Data Engineering, pages 896--905. IEEE, 2007.
[29]
Erik Demaine, Alejandro López-Ortiz, and J Munro. Frequency estimation of internet packet streams with limited space. Algorithms-ESA 2002, 2002.
[30]
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In ¶roc VLDB 2002, pages 346--357.
[31]
Monika Rauch Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115--123, 2004.
[32]
Er Krishnamurthy, Subhabrata Sen, and Yin Zhang. Sketchbased change detection: Methods, evaluation, and applications. In In ACM SIGCOMM Internet Measurement Conference. Citeseer, 2003.
[33]
Robert Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004.
[34]
Chung Chen and Lon-Mu Liu. Forecasting time series with outliers. Journal of Forecasting, 12(1):13--35, 1993.
[35]
Viswanath Poosala and Yannis E Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, pages 448--459, 1996.
[36]
Shanshan Ying, Flip Korn, Barna Saha, and Divesh Srivastava. Treescope: finding structural anomalies in semi-structured data. VLDB, 2015.
[37]
Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ¶roc ACM SIGMETRICS, pages 177--188, 2004.
[38]
Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449--472, 2016.
[39]
Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999.
[40]
Zhetao Li, Baoming Chang, Shiguo Wang, Anfeng Liu, Fanzi Zeng, and Guangming Luo. Dynamic compressive wide-band spectrum sensing based on channel energy reconstruction in cognitive internet of things. IEEE Transactions on Industrial Informatics, 2018.
[41]
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth finding on the deep web: Is the problem solved? In Proceedings of the VLDB Endowment, volume 6, pages 97--108, 2012.
[42]
Zhetao Li, Fu Xiao, Shiguo Wang, Tingrui Pei, and Jie Li. Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications, 36(2):304--313, 2018.
[43]
Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. In ¶roc ACM SIGMETRICS, pages 145--156, 2006.
[44]
The caida anonymized internet traces 2016. http://www.caida.org/data/overview/vspace0.03in.
[45]
Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/.
[46]
Christian Henke, Carsten Schmoll, and Tanja Zseby. Empirical evaluation of hash functions for multipoint measurements. SIGCOMM CCR., 2008.

Cited By

View all
  • (2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
  • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Stable-Sketch: A Versatile Sketch for Accurate, Fast, Web-Scale Data Stream ProcessingProceedings of the ACM Web Conference 202410.1145/3589334.3645581(4227-4238)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data stream processing
  2. data sturcture
  3. probabilistic and approximate data

Qualifiers

  • Research-article

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)12
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
  • (2024)Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded MemoryProceedings of the ACM on Management of Data10.1145/36392852:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Stable-Sketch: A Versatile Sketch for Accurate, Fast, Web-Scale Data Stream ProcessingProceedings of the ACM Web Conference 202410.1145/3589334.3645581(4227-4238)Online publication date: 13-May-2024
  • (2024)A Generic Framework for Finding Special Quadratic Elements in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.339202932:4(3269-3284)Online publication date: Aug-2024
  • (2024)A Probabilistic Sketch for Summarizing Cold Items of Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2023.331642632:2(1287-1302)Online publication date: Apr-2024
  • (2024)P-Sketch: A Fast and Accurate Sketch for Persistent Item LookupIEEE/ACM Transactions on Networking10.1109/TNET.2023.330689732:2(987-1002)Online publication date: Apr-2024
  • (2024)Multithreading Heterogeneous Graph AggregationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3320127(1-15)Online publication date: 2024
  • (2024)Hierarchical Sketch: An Efficient, Scalable and Latency-aware Content Caching Design for Content Delivery Networks2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682952(1-6)Online publication date: 19-Jun-2024
  • (2024)Advancing Sketch-Based Network Measurement: A General, Fine-Grained, Bit-Adaptive Sliding Window Framework2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682923(1-10)Online publication date: 19-Jun-2024
  • (2024)NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00052(628-643)Online publication date: 29-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media