Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3580305.3599433acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches

Published: 04 August 2023 Publication History

Abstract

We abstract a MIMO scenario in distributed data stream mining, where a stream of multiple items is mined by multiple nodes. We design a framework named MimoSketch for the MIMO-specific scenario, which improves the fundamental mining task of item frequency estimation. MimoSketch consists of an algorithm design and a policy to schedule items to nodes. MimoSketch's algorithm applies random counting to preserve a mathematically proven unbiasedness property, which makes it friendly to the aggregate query on multiple nodes; its memory layout is dynamically adaptive to the runtime item size distribution, which maximizes the estimation accuracy by storing more items. MimoSketch's scheduling policy balances items among nodes, avoiding nodes being overloaded or underloaded, which improves the overall mining accuracy. Our prototype and evaluation show that our algorithm can improve the item frequency estimation accuracy by an order of magnitude compared with the state-of-the-art solutions, and the scheduling policy further promotes the performance in MIMO scenarios.

Supplementary Material

MP4 File (rtfp1460-2min-promo.mp4)
Promotional video of "MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches" We first explain "MIMO" and its four application scenarios: distributed data analytics, distributed machine learning, network telemetry, and Internet of Thins (IoT). Targeting data stream mining of item frequency in MIMO scenarios, we propose a framework named "MimoSketch", which mainly consists of two parts: a sketch algorithm and a scheduling policy. MimoSketch achieves unbiasedness, accuracy, and load balance, with both theory guarantee and experiment progress.

References

[1]
aappleby. 2015. MurmurHash. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
[2]
Anonymous Author(s). 2022. Source codes of MimoSketch and other baselines. https://github.com/MimoSketch/MimoSketch
[3]
Mohammad Hossein Bateni, Hossein Esfandiari, and Vahab Mirrokni. 2018. Optimal distributed submodular optimization via sketching. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1138--1147.
[4]
Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. 2001. Towards Sensor Database Systems. In Proceedings of the Second International Conference on Mobile Data Management. 3--14.
[5]
CAIDA. 2016. Anonymized Internet Traces. http://www.caida.org/data/overview
[6]
Francesco Cauteruccio, Luca Cinelli, Enrico Corradini, Giorgio Terracina, Domenico Ursino, Luca Virgili, Claudio Savaglio, Antonio Liotta, and Giancarlo Fortino. 2021. A framework for anomaly detection and classification in Multiple IoT scenarios. Future Generation Computer Systems, Vol. 114 (2021), 322--335.
[7]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding Frequent Items in Data Streams. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming. 693--703.
[8]
Lucchese Claudio, Orlando Salvatore, Perego Raffaele, and Silvestri Fabrizio. 2003. WebDocs: a real-life huge transactional dataset. http://fimi.uantwerpen.be/data/webdocs.pdf
[9]
Graham Cormode and S Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, Vol. 55, 1 (2005), 58--75.
[10]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, 1 (2008), 107--113.
[11]
Cristian Estan and George Varghese. 2003. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), Vol. 21, 3 (2003), 270--313.
[12]
Bin Fan, Dave G Andersen, Michael Kaminsky, and Michael D Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. 75--88.
[13]
FIMI. 2003. Frequent Itemset Mining Dataset Repository. http://fimi.uantwerpen.be/data/
[14]
Xiangyang Gou, Long He, Yinda Zhang, Ke Wang, Xilai Liu, Tong Yang, Yi Wang, and Bin Cui. 2020. Sliding sketches: A framework using time zones for data stream processing in sliding windows. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1015--1025.
[15]
Jiqing Gu, Chao Song, Haipeng Dai, Lei Shi, Jinqiu Wu, and Li Lu. 2022. ACM: Accuracy-Aware Collaborative Monitoring for Software-Defined Network-Wide Measurement. Sensors, Vol. 22, 20 (2022), 7932.
[16]
Şule Gündüz and M Tamer Özsu. 2003. A web page prediction model based on click-stream tree representation of user behavior. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 535--540.
[17]
Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proceedings of the national academy of sciences, Vol. 33, 2 (1947), 25--31.
[18]
Qun Huang and Patrick PC Lee. 2014. Ld-sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1420--1428.
[19]
Qun Huang, Haifeng Sun, Patrick PC Lee, Wei Bai, Feng Zhu, and Yungang Bao. 2020. Omnimon: Re-architecting network telemetry with resource efficiency and full accuracy. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 404--421.
[20]
Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, and Raman Arora. 2019. Communication-efficient distributed SGD with sketching. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13142--13152.
[21]
Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data. 1269--1284.
[22]
Lu Jie, Chen Hongchang, Sun Penghao, Hu Tao, and Zhang Zhen. 2021. OrderSketch:: An Unbiased and Fast Sketch for Frequency Estimation of Data Streams. Computer Networks, Vol. 201 (2021), 108563.
[23]
Brian W Kernighan and Dennis M Ritchie. 1978. The C Programming Language, Prentice Hall. Englewood Cliffs, New Jersey (1978).
[24]
Haoyu Li, Qizhi Chen, Yixin Zhang, Tong Yang, and Bin Cui. 2022. Stingy sketch: a sketch framework for accurate and fast frequency estimation. Proceedings of the VLDB Endowment, Vol. 15, 7 (2022), 1426--1438.
[25]
Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. Wavingsketch: An unbiased and generic sketch for finding top-k items in data streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1574--1584.
[26]
Xin Li, Fang Bian, Mark Crovella, Christophe Diot, Ramesh Govindan, Gianluca Iannaccone, and Anukool Lakhina. 2006. Detection and identification of network anomalies using sketch subspaces. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement. 147--152.
[27]
Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases. 346--357.
[28]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th international conference on Database Theory. 398--412.
[29]
Cheng-Chieh Peng, Kuo-Shiang Hsu, and Pi-Chung Wang. 2021. Collaborative Traffic Measurement Using Sketches for Software Defined Networks. In 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT). IEEE, 81--87.
[30]
Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. 2020. FetchSGD: communication-efficient federated learning with sketching. In Proceedings of the 37th International Conference on Machine Learning. 8253--8265.
[31]
Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data. 1449--1463.
[32]
Qilong Shi, Yuchen Xu, Jiuhua Qi, Wenjun Li, Tong Yang, Yang Xu, and Yi Wang. 2023. Cuckoo Counter: Adaptive Structure of Counters for Accurate Frequency and Top-k Estimation. IEEE/ACM Transactions on Networking (2023).
[33]
Aravind Srinivasan. 1999. Approximation algorithms via randomized rounding: a survey. Series in Advanced Topics in Mathematics, Polish Scientific Publishers PWN (1999), 9--71.
[34]
Lu Tang, Qun Huang, and Patrick PC Lee. 2019. MV-Sketch: A Fast and Compact Invertible Sketch for Heavy Flow Detection in Network Data Streams. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 2026--2034.
[35]
Daniel Ting. 2018. Data sketches for disaggregated subset sum and frequent item estimation. In Proceedings of the 2018 International Conference on Management of Data. 1129--1140.
[36]
Daniel Ting, Jonathan Malkin, and Lee Rhodes. 2020. Data sketching for real time analytics: Theory and practice. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3567--3568.
[37]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John CS Lui, and Xiaohong Guan. 2019. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 25--33.
[38]
Hongli Xu, Shigang Chen, Qianpiao Ma, and Liusheng Huang. 2019. Lightweight Flow Distribution for Collaborative Traffic Measurement in Software Defined Networks. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1108--1116.
[39]
Tong Yang, Siang Gao, Zhouyi Sun, Yufei Wang, Yulong Shen, and Xiaoming Li. 2019. Diamond sketch: Accurate per-flow measurement for big streaming data. IEEE Transactions on Parallel and Distributed Systems, Vol. 30, 12 (2019), 2650--2662.
[40]
Tong Yang, Junzhi Gong, Haowei Zhang, Lei Zou, Lei Shi, and Xiaoming Li. 2018a. HeavyGuardian: Separate and guard hot items in data streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2584--2593.
[41]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018b. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 561--575.
[42]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid sketch: A sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, Vol. 10, 11 (2017), 1442--1453.
[43]
Shanshan Ying, Flip Korn, Barna Saha, and Divesh Srivastava. 2015. Treescope: finding structural anomalies in semi-structured data. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1904--1907.
[44]
Xiwen Yu, Hongli Xu, Da Yao, Haibo Wang, and Liusheng Huang. 2018. CountMax: A Lightweight and Cooperative Sketch Measurement for Software-Defined Networks. IEEE/ACM Transactions on Networking (TON), Vol. 26, 6 (2018), 2774--2786.
[45]
Yutong Zhai, Hongli Xu, Haibo Wang, Zeyu Meng, and He Huang. 2020. Joint Routing and Sketch Configuration in Software-Defined Networking. IEEE/ACM Transactions on Networking, Vol. 28, 5 (2020), 2092--2105.
[46]
Yinda Zhang, Jinyang Li, Yutian Lei, Tong Yang, Zhetao Li, Gong Zhang, and Bin Cui. 2020. On-off sketch: A fast and accurate sketch on persistence. Proceedings of the VLDB Endowment, Vol. 14, 2 (2020), 128--140.
[47]
Bohan Zhao, Xiang Li, Boyu Tian, Zhiyu Mei, and Wenfei Wu. 2021a. DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2285--2293.
[48]
Yikai Zhao, Zheng Zhong, Yuanpeng Li, Yi Zhou, Yifan Zhu, Li Chen, Yi Wang, and Tong Yang. 2021b. Cluster-Reduce: Compressing Sketches for Distributed Data Streams. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2316--2326.
[49]
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. 2018. Cold filter: A meta-framework for faster and more accurate stream processing. In Proceedings of the 2018 International Conference on Management of Data. 741--756.
[50]
You Zhou, Youlin Zhang, Chaoyi Ma, Shigang Chen, and Olufemi O Odegbile. 2019. Generalized sketch families for network traffic measurement. Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 3, 3 (2019), 1--34.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed data stream mining
  2. scheduling policy
  3. unbiased sketch

Qualifiers

  • Research-article

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 267
    Total Downloads
  • Downloads (Last 12 months)186
  • Downloads (Last 6 weeks)17
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media