Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Enhancing Accuracy for Super Spreader Identification in High-Speed Data Streams

Published: 30 August 2024 Publication History

Abstract

This paper addresses the challenge of identifying super spreaders within large, high-speed data streams. In these streams, data is segmented into flows, with each flow's spread defined as the number of distinct items it contains. A super spreader is characterized as a flow with a notably large spread. Current compact solutions, known as sketches, are designed to fit within the constrained memory of online devices. However, they struggle to accurately track the spread of all flows due to the substantial memory requirement for monitoring a single flow --- a problem exacerbated when numerous flows are involved. To overcome these limitations, this study proposes a more precise sketch-based approach. Our solution introduces an innovative non-duplicate sampler that effectively eliminates duplicates, allowing for accurate post-sampling count of flow spread using only counters. Additionally, it incorporates an exponential-weakening decay technique to highlight large flows, markedly enhancing the accuracy of super spreader identification. We offer a comprehensive theoretical analysis of our method. Trace-driven experiments validate that our approach statistically surpasses existing state-of-the-art solutions in identifying super spreaders. It also demonstrates the lowest time required to restore super spreaders and significantly reduces bandwidth consumption by an order of magnitude when offline restoration is conducted remotely.

References

[1]
2016. Retailrocket Recommender System Dataset, last accessed date: July 10 2024. https://www.kaggle.com/retailrocket/ecommerce-dataset.
[2]
A. Akella, A. Bharambe, M. Reiter, and S. Seshan. 2003. Detecting DDoS Attacks on ISP Networks. In Proceedings of the Twenty-Second ACM SIGMOD/PODS Workshop on Management and Processing of Data Streams. 1--3.
[3]
J. Anton, L. Jacobs, X. Liu, J. Parker, Z. Zeng, and Zhong. 2002. Web Caching for Database Applications with Oracle Web Cache. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 594--599.
[4]
Shunji Aoyagi, Jong-Deok Kim, Kien Nguyen, and Hiroo Sekiya. 2023. An Extensive Evaluation of TCP Congestion Control in 10 Gbps Network. In 2023 24st Asia-Pacific Network Operations and Management Symposium (APNOMS). IEEE, 306--309.
[5]
Ran Ben Basat, Xiaoqi Chen, Gil Einziger, Shir Landau Feibish, Danny Raz, and Minlan Yu. 2020. Routing Oblivious Measurement Analytics. In 2020 IFIP Networking Conference (Networking). IEEE, 449--457.
[6]
Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2016. Heavy Hitters in Streams and Sliding Windows. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications. IEEE, 1--9.
[7]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, and Erez Waisbard. 2017. Constant Time Updates in Hierarchical Heavy Hitters. Proc. of ACM SIGCOMM (2017).
[8]
A. Bronselaer, S. Debergh, D. Van Hyfte, and G. D. Tré. 2010. Estimation of Topic Cardinality in Document Collections. In SIAM Conference on Data Mining (SDM 10). SIAM, 31--39.
[9]
Jing Cao, Yu Jin, Aiyou Chen, Tian Bu, and Z-L Zhang. 2009. Identifying high cardinality internet hosts. In IEEE INFOCOM 2009. IEEE, 810--818.
[10]
Jiale Chen, Xingshu Chen, Liangguo Chen, Xiao Lan, and Yonggang Luo. 2023. ANTI: An Adaptive Network Traffic Indexing Algorithm for High-Speed Networks. In GLOBECOM 2023--2023 IEEE Global Communications Conference. IEEE, 1699--1704.
[11]
S. Chen and Y. Tang. 2004. Slowing Down Internet Worms. In 24th International Conference on Distributed Computing Systems, 2004. Proceedings. IEEE, 312--319.
[12]
Peter Cogan, Matthew Andrews, Milan Bradonjic, W Sean Kennedy, Alessandra Sala, and Gabriel Tucci. 2012. Reconstruction and analysis of twitter conversation graphs. In Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research. 25--31.
[13]
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT press.
[14]
Graham Cormode and S. Muthukrishnan. 2004. An Improved Data Stream Summary: the Count-Min Sketch and Its Applications. Proc. of LATIN (2004).
[15]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.
[16]
E. Demaine, A. Lopez-Ortiz, and J. Munro. 2002. Frequency Estimation of Internet Packet Streams with Limited Space. Proc. of 10th ESA Annual European Symposium on Algorithms (September 2002).
[17]
Yang Du, He Huang, Yu-E Sun, Shigang Chen, and Guoju Gao. 2021. Self-adaptive sampling for network traffic measurement. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--10.
[18]
M. Durand and P. Flajolet. 2003. Loglog Counting of Large Cardinalities. In European Symposium on Algorithms. Springer, 605--617.
[19]
Z. Durumeric, M. Bailey, and J A. Halderman. 2014. An Internet-wide View of Internet-wide Scanning. In 23rd {USENIX} Security Symposium ({USENIX} Security 14). 65--78.
[20]
Gil Einziger and Roy Friedman. 2019. Counting with Tinytable: Every Bit Counts! IEEE Access (2019).
[21]
Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. 153--166.
[22]
C. Estan, G. Varghese, and M. Fisk. 2003. Bitmap Algorithms for Counting Active Flows on High Speed Links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. 153--166.
[23]
C. Estan, G. Varghese, and M. Fisk. 2006. Bitmap Algorithms for Counting Active Flows on High-speed Links. IEEE/ACM Transactions on Networking 14, 5 (2006), 925--937.
[24]
P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. 2007. Hyperloglog: the Analysis of a Near-optimal Cardinality Estimation Algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.
[25]
P. Flajolet and G N. Martin. 1985. Probabilistic Counting Algorithms for Data Base Applications. Journal of computer and system sciences 31, 2 (1985), 182--209.
[26]
S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani. 2007. Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks. In 27th International Conference on Distributed Computing Systems (ICDCS '07). 4--4.
[27]
Eimantas Garsva, Nerijus Paulauskas, and Gediminas Grazulevicius. 2015. Packet size distribution tendencies in computer network flows. In 2015 Open Conference of Electrical, Electronic and Information Sciences (eStream). IEEE, 1--6.
[28]
J. Gong, T. Yang, H. Zhang, H. Li, S. Uhlig, S. Chen, L. Uden, and X. Li. 2018. HeavyKeeper: An Accurate Algorithm for Finding Top-k Elephant Flows. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 909--921. https://www.usenix.org/conference/atc18/presentation/gong
[29]
A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. 2012. Processing a Trillion Cells per Mouse Click. Proceedings of the VLDB Endowment 5, 11 (2012), 1436--1446.
[30]
S. Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 683--692.
[31]
Q. Huang, X. Jin, P. P. Lee, R. Li, L. Tang, Y. Chen, and G. Zhang. 2017. Sketchvisor: Robust Network Measurement for Software Packet Processing. In Proceedings of the 2017 ACM SIGCOMM Conference. 113--126.
[32]
Q. Huang, P. P. C. Lee, and Y. Bao. 2018. SketchLearn: Relieving User Burdens in Approximate Measurement with Automated Statistical Inference. Proc. of ACM SIGCOMM (August 2018), 576--590.
[33]
Nicole Immorlica, Kamal Jain, Mohammad Mahdian, and Kunal Talwar. 2005. Click fraud resistant methods for learning click-through rates. In Internet and Network Economics: First International Workshop, WINE 2005, Hong Kong, China, December 15--17, 2005. Proceedings 1. Springer, 34--45.
[34]
Xuyang Jing, Zheng Yan, Hui Han, and Witold Pedrycz. 2021. ExtendedSketch: Fusing network traffic for super host identification with a memory efficient sketch. IEEE Transactions on Dependable and Secure Computing 19, 6 (2021), 3913--3924.
[35]
Noriaki Kamiyama, Tatsuya Mori, and Ryoichi Kawahara. 2007. Simple and adaptive identification of superspreaders by flow sampling. In IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications. IEEE, 2481--2485.
[36]
Raymond YK Lau, SY Liao, Ron Chi-Wai Kwok, Kaiquan Xu, Yunqing Xia, and Yuefeng Li. 2012. Text mining and probabilistic language modeling for online review spam detection. ACM Transactions on Management Information Systems (TMIS) 2, 4 (2012), 1--30.
[37]
Xiang Li, Chao Wang, Jiwei Tan, Xiaoyi Zeng, Dan Ou, Dan Ou, and Bo Zheng. 2020. Adversarial multimodal representation learning for click-through rate prediction. In Proceedings of The Web Conference 2020. 827--836.
[38]
Y. Li, R. Miao, C. Kim, and M. Yu. 2016. Flowradar: A Better Netflow for Data Centers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16). 311--324.
[39]
Weijiang Liu, Wenyu Qu, Jian Gong, and Keqiu Li. 2015. Detection of superpoints using a vector bloom filter. IEEE Transactions on Information Forensics and Security 11, 3 (2015), 514--527.
[40]
Y. Liu, W. Chen, and Y. Guan. 2016. Identifying High-Cardinality Hosts from Network-Wide Traffic Measurements. IEEE Transactions on Dependable and Secure Computing 13, 5 (Sep. 2016), 547--558.
[41]
Z. Liu, R. Ben-Basat, G. Einziger, Y. Kassner, V. Braverman, R. Friedman, and V. Sekar. 2019. Nitrosketch: Robust and General Sketch-based Monitoring in Software Switches. In Proceedings of the 2019 ACM SIGCOMM Conference. 334--350.
[42]
Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. 2016. One Sketch to Rule Them All: Rethinking Network Flow Monitoring with Univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. 101--114.
[43]
Zaoxing Liu, Hun Namkung, Georgios Nikolaidis, Jeongkeun Lee, Changhoon Kim, Xin Jin, Vladimir Braverman, Minlan Yu, and Vyas Sekar. 2021. Jaqen: A {High-Performance}{Switch-Native} approach for detecting and mitigating volumetric {DDoS} attacks with programmable switches. In 30th USENIX Security Symposium (USENIX Security 21). 3829--3846.
[44]
Chaoyi Ma, Shigang Chen, Youlin Zhang, Qingjun Xiao, and Olufemi O Odegbile. 2021. Super spreader identification using geometric-min filter. IEEE/ACM Transactions on Networking 30, 1 (2021), 299--312.
[45]
Chaoyi Ma, Olufemi O Odegbile, Dimitrios Melissourgos, Haibo Wang, and Shiping Chen. 2023. From CountMin to Super kJoin Sketches for Flow Spread Estimation. IEEE Transactions on Network Science and Engineering (2023).
[46]
Ankush Mandal, He Jiang, Anshumali Shrivastava, and Vivek Sarkar. 2018. Topkapi: parallel and fast sketches for finding top-k frequent elements. Advances in Neural Information Processing Systems 31 (2018).
[47]
G. Manku and R. Motwani. 2002. Approximate Frequency Counts over Data Streams. Proc. of VLDB (August 2002).
[48]
Dimitrios Melissourgos, Haibo Wang, Shigang Chen, Chaoyi Ma, and Shiping Chen. 2023. Single Update Sketch with Variable Counter Structure. Proceedings of the VLDB Endowment 16, 13 (2023), 4296--4309.
[49]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. 2010. Dremel: Interactive Analysis of Web-scale Datasets. Proceedings of the VLDB Endowment 3, 1--2 (2010), 330--339.
[50]
A. Metwally, D. Agrawal, and A. E. Abbadi. 2005. Efficient Computation of Frequent and Top-k Elements in Data Streams. Proc. of 10th International Conference on Database Theory (ICDT) (January 2005).
[51]
J. Mirkovic and P. Reiher. 2004. A Taxonomy of DDoS Attack and DDoS Defense Mechanisms. ACM SIGCOMM Computer Communication Review 34, 2 (2004), 39--53.
[52]
Duc-Minh Ngo, Cuong Pham-Quoc, and Tran Ngoc Thinh. 2018. An efficient high-throughput and low-latency syn flood defender for high-speed networks. Security and Communication Networks 2018 (2018), 1--14.
[53]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. 2005. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming 13, 4 (2005), 277--298.
[54]
David Plonka. 2000. FlowScan: A Network Traffic Flow Reporting and Visualization Tool. In LISA. 305--317.
[55]
Martin Roesch et al. 1999. Snort: Lightweight intrusion detection for networks. In Lisa, Vol. 99. 229--238.
[56]
P. Roy, A. Khan, and G. Alonso. 2016. Augmented Sketch: Faster and More Accurate Stream Processing. In Proceedings of the 2016 International Conference on Management of Data. 1449--1463.
[57]
S. Sen and J. Wang. 2002. Analyzing Peer-to-peer Traffic Across Large Networks. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment. ACM, 137--150.
[58]
S. Singh, C. Estan, G. Varghese, and S. Savage. 2004. Automated Worm Fingerprinting. In OSDI, Vol. 4. 4--4.
[59]
A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller. 2010. An Overview of IP Flow-based Intrusion Detection. IEEE communications surveys & tutorials 12, 3 (2010), 343--356.
[60]
L. Tang, Q. Huang, and P. Lee. 2020. SpreadSketch: Toward Invertible and Network-Wide Detection of Superspreaders. (2020).
[61]
D. Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 442--451.
[62]
UCSD. 2015. CAIDA UCSD Anonymized 2015 Internet Traces on Jan. 17, last accessed date: July 10 2024,. https://www.caida.org/data/passive/passive_2015_dataset.xml.
[63]
M. Vartak, V. Raghavan, and E. Rundensteiner. 2010. Qrelx: Generating Meaningful Queries That Provide Cardinality Assurance. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 1215--1218.
[64]
Ss Venkataraman, Ds Song, Ps B Gibbons, and As Blum. 2005. New Streaming Algorithms for Fast Detection of Superspreaders. Technical Report.
[65]
Haibo Wang, Chaoyi Ma, Shigang Chen, and Yuanda Wang. 2022. Fast and accurate cardinality estimation by self-morphing bitmaps. IEEE/ACM Transactions on Networking 30, 4 (2022), 1674--1688.
[66]
Haibo Wang, Chaoyi Ma, Shigang Chen, and Yuanda Wang. 2022. Online Cardinality Estimation by Self-morphing Bitmaps. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1--13.
[67]
Haibo Wang, Chaoyi Ma, Olufemi O Odegbile, Shigang Chen, and Jih-Kwon Peir. 2021. Randomized Error Removal for Online Spread Estimation in Data Streaming. Proceedings of the VLDB Endowment 14, 6 (2021), 1040--1052.
[68]
Haibo Wang, Chaoyi Ma, Olufemi O Odegbile, Shigang Chen, and Jih-Kwon Peir. 2022. Randomized error removal for online spread estimation in high-speed networks. IEEE/ACM Transactions on Networking (2022).
[69]
Haibo Wang, Dimitrios Melissourgos, Chaoyi Ma, and Shigang Chen. 2023. Real-time Spread Burst Detection in Data Streaming. Proceedings of the ACM on Measurement and Analysis of Computing Systems 7, 2 (2023), 1--31.
[70]
H. Wang, H. Xu, L. Huang, and Y. Zhai. 2020. Fast and Accurate Traffic Measurement with Hierarchical Filtering. IEEE Transactions on Parallel and Distributed Systems 31, 10 (2020), 2360--2374.
[71]
L. Wang, T. Yang, H. Wang, J. Jiang, Z. Cai, B. Cui, and X. Li. 2019. Fine-grained Probability Counting for Cardinality Estimation of Data Streams. World Wide Web 22, 5 (2019), 2065--2081.
[72]
Pinghui Wang, Xiaohong Guan, Tao Qin, and Qiuzhen Huang. 2011. A data streaming method for monitoring host connection degrees of high-speed links. IEEE Transactions on Information Forensics and Security 6, 3 (2011), 1086--1098.
[73]
Pinghui Wang, Peng Jia, Xiangliang Zhang, Jing Tao, Xiaohong Guan, and Don Towsley. 2019. Utilizing dynamic properties of sharing bits and registers to estimate user cardinalities over time. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1094--1105.
[74]
Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining. 261--270.
[75]
K. Whang, B. T Vander-Zanden, and H. M Taylor. 1990. A Linear-time Probabilistic Counting Algorithm for Database Applications. ACM Transactions on Database Systems (TODS) 15, 2 (1990), 208--229.
[76]
Qingjun Xiao, Shigang Chen, Min Chen, and Yibei Ling. 2015. Hyper-compact virtual estimators for big network data based on register sharing. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 417--428.
[77]
Qingjun Xiao, Shigang Chen, You Zhou, Min Chen, Junzhou Luo, Tengli Li, and Yibei Ling. 2017. Cardinality estimation for elephant flows: A compact solution based on virtual register sharing. IEEE/ACM Transactions on Networking 25, 6 (2017), 3738--3752.
[78]
T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao, X. Li, and S. Uhlig. 2018. Elastic Sketch: Adaptive and Fast Network-wide Measurements. In Proceedings of the 2018 ACM SIGCOMM Conference. 561--575.
[79]
M. Yoon, T. Li, S. Chen, and J. Peir. 2009. Fit a Spread Estimator in Small Memory. In IEEE INFOCOM 2009. IEEE, 504--512.
[80]
M Yoon, Tao Li, Shigang Chen, and J-K Peir. 2009. Fit a spread estimator in small memory. In IEEE INFOCOM 2009. IEEE, 504--512.
[81]
MyungKeun Yoon, Tao Li, Shigang Chen, and Jih-Kwon Peir. 2010. Fit a compact spread estimator in small high-speed memory. IEEE/ACM Transactions on Networking 19, 5 (2010), 1253--1264.
[82]
M. Yu, L. Jose, and R. Miao. 2013. Software Defined Traffic Measurement with OpenSketch. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 29--42.
[83]
X. Yu, H. Xu, D. Yao, H. Wang, and L. Huang. 2018. Countmax: A Lightweight and Cooperative Sketch Measurement for Software-defined Networks. IEEE/ACM Transactions on Networking 26, 6 (2018), 2774--2786.
[84]
Y. Zhou, T. Yang, J. Jiang, B. Cui, M. Yu, X. Li, and S. Uhlig. 2018. Cold Filter: A Meta-framework for Faster and More Accurate Stream Processing. In Proceedings of the 2018 International Conference on Management of Data. 741--756.
[85]
Y. Zhou, Y. Zhang, C. Ma, S. Chen, and O. O Odegbile. 2019. Generalized Sketch Families for Network Traffic Measurement. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, 3 (2019), 1--34.

Index Terms

  1. Enhancing Accuracy for Super Spreader Identification in High-Speed Data Streams
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 17, Issue 11
            July 2024
            1039 pages
            Issue’s Table of Contents

            Publisher

            VLDB Endowment

            Publication History

            Published: 30 August 2024
            Published in PVLDB Volume 17, Issue 11

            Check for updates

            Badges

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 18
              Total Downloads
            • Downloads (Last 12 months)18
            • Downloads (Last 6 weeks)13
            Reflects downloads up to 27 Nov 2024

            Other Metrics

            Citations

            View Options

            Login options

            Full Access

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media