Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fast and Accurate Cardinality Estimation by Self-Morphing Bitmaps

Published: 10 February 2022 Publication History

Abstract

Estimating the cardinality of a data stream is a fundamental problem underlying numerous applications such as traffic monitoring in a network or a datacenter and query optimization of Internet-scale P2P data networks. Existing solutions suffer from high processing/query overhead or memory in-efficiency, which prevents them from operating online for data streams with very high arrival rates. This paper takes a new solution path different from the prior art and proposes a self-morphing bitmap, which combines operational simplicity with structural dynamics, allowing the bitmap to be morphed in a series of steps with an evolving sampling probability that automatically adapts to different stream sizes. We further generalize the design of self-morphing bitmap. We evaluate the self-morphing bitmap theoretically and experimentally. The results demonstrate that it significantly outperforms the prior art.

References

[1]
P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for data base applications,” J. Comput. Syst. Sci., vol. 31, no. 2, pp. 182–209, Oct. 1985.
[2]
K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, “A linear-time probabilistic counting algorithm for database applications,” ACM Trans. Database Syst., vol. 15, no. 2, pp. 208–229, Jun. 1990.
[3]
C. Estan, G. Varghese, and M. Fisk, “Bitmap algorithms for counting active flows on high speed links,” in Proc. 3rd ACM SIGCOMM Conf. Internet Meas. (IMC), 2003, pp. 153–166.
[4]
C. Estan, G. Varghese, and M. Fisk, “Bitmap algorithms for counting active flows on high-speed links,” IEEE/ACM Trans. Netw., vol. 14, no. 5, pp. 925–937, Oct. 2006.
[5]
C. Qian, H. Ngan, Y. Liu, and L. M. Ni, “Cardinality estimation for large-scale RFID systems,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 9, pp. 1441–1454, Sep. 2011.
[6]
L. Wanget al., “Fine-grained probability counting for cardinality estimation of data streams,” World Wide Web, vol. 22, no. 5, pp. 2065–2081, Sep. 2019.
[7]
Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman, “One sketch to rule them all: Rethinking network flow monitoring with UnivMon,” in Proc. ACM SIGCOMM Conf., 2016, pp. 101–114.
[8]
J. Yanget al., “Elastic sketch: Adaptive and fast network-wide measurements,” in Proc. Conf. ACM Special Interest Group Data Commun., 2018, pp. 561–575.
[9]
C. Ma, H. Wang, O. O. Odegbile, and S. Chen, “Virtual filter for non-duplicate sampling,” in Proc. IEEE 29th Int. Conf. Netw. Protocols (ICNP), Nov. 2021, pp. 1–10.
[10]
Y. Zhou, Y. Zhang, C. Ma, S. Chen, and O. O. Odegbile, “Generalized sketch families for network traffic measurement,” Proc. ACM Meas. Anal. Comput. Syst., vol. 3, no. 3, pp. 1–34, Dec. 2019.
[11]
Q. Xiao, S. Chen, M. Chen, and Y. Ling, “Hyper-compact virtual estimators for big network data based on register sharing,” in Proc. ACM SIGMETRICS Int. Conf. Meas. Modeling Comput. Syst., Jun. 2015, pp. 417–428.
[12]
H. Dai, M. Li, and A. Liu, “Finding persistent items in distributed datasets,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Apr. 2018, pp. 1403–1411.
[13]
H. Dai, M. Li, A. X. Liu, J. Zheng, and G. Chen, “Finding persistent items in distributed datasets,” IEEE/ACM Trans. Netw., vol. 28, no. 1, pp. 1–14, Feb. 2020.
[14]
C. Ma, H. Wang, O. Odegbile, and S. Chen, “Noise measurement and removal for data streaming algorithms with network applications,” in Proc. IFIP Netw. Conf. (IFIP Networking), Jun. 2021, pp. 1–9.
[15]
C. Ma, S. Chen, Y. Zhang, Q. Xiao, and O. O. Odegbile, “Super spreader identification using geometric-min filter,” IEEE/ACM Trans. Netw., early access, Aug. 31, 2021. 10.1109/TNET.2021.3108033.
[16]
S. Nath, P. B. Gibbons, S. Seshan, and Z. Anderson, “Synopsis diffusion for robust aggregation in sensor networks,” ACM Trans. Sensor Netw., vol. 4, no. 2, pp. 1–40, Mar. 2008.
[17]
N. Ntarmos, P. Triantafillou, and G. Weikum, “Counting at large: Efficient cardinality estimation in internet-scale data networks,” in Proc. 22nd Int. Conf. Data Eng. (ICDE), Apr. 2006, p. 40.
[18]
S. Ganguly, M. Garofalakis, R. Rastogi, and K. Sabnani, “Streaming algorithms for robust, real-time detection of DDoS attacks,” in Proc. 27th Int. Conf. Distrib. Comput. Syst. (ICDCS), 2007, p. 4.
[19]
N. Zhaoet al., “Automatically and adaptively identifying severe alerts for online service systems,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Jul. 2020, pp. 2420–2429.
[20]
S. Heule, M. Nunkesser, and A. Hall, “Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm,” in Proc. 16th Int. Conf. Extending Database Technol., 2013, pp. 683–692.
[21]
P. Rob, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with Sawzall,” Sci. Program., vol. 13, no. 4, pp. 277–298, 2005.
[22]
S. Melniket al., “Dremel: Interactive analysis of web-scale datasets,” Proc. VLDB Endowment, vol. 3, nos. 1–2, pp. 330–339, Sep. 2010.
[23]
A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser, “Processing a trillion cells per mouse click,” Proc. VLDB Endowment, vol. 5, no. 11, pp. 1436–1446, Jul. 2012.
[24]
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, “Counting distinct elements in a data stream,” in Proc. Int. Workshop Randomization Approximation Techn. Comput. Sci. Berlin, Germany: Springer, 2002, pp. 1–10.
[25]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla, “On synopses for distinct-value estimation under multiset operations,” in Proc. ACM SIGMOD Int. Conf. Manage. Data (SIGMOD), 2007, pp. 199–210.
[26]
M. Durand and P. Flajolet, “Loglog counting of large cardinalities,” in Proc. Eur. Symp. Algorithms. Berlin, Germany: Springer, 2003, pp. 605–617.
[27]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier, “Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm,” in Proc. Int. Conf. Anal. Algorithms, Nancy, France, 2007, pp. 137–156.
[28]
H. Harmouch and F. Naumann, “Cardinality estimation: An experimental survey,” Proc. VLDB Endowment, vol. 11, no. 4, pp. 499–512, Dec. 2017.
[29]
A. Metwally, D. Agrawal, and A. E. Abbadi, “Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic,” in Proc. 11th Int. Conf. Extending Database Technol., Adv. Database Technol., 2008, pp. 618–629.
[30]
D. M. Kane, J. Nelson, and D. P. Woodruff, “An optimal algorithm for the distinct elements problem,” in Proc. 29th ACM SIGMOD-SIGACT-SIGART Symp. Princ. Database Syst. Data (PODS), 2010, pp. 41–52.
[31]
Q. Xiao, Y. Zhou, and S. Chen, “Better with fewer bits: Improving the performance of cardinality estimation of large data streams,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), May 2017, pp. 1–9.
[32]
Q. Xiao, S. Chen, Y. Zhou, and J. Luo, “Estimating cardinality for arbitrarily large data stream with improved memory efficiency,” IEEE/ACM Trans. Netw., vol. 28, no. 2, pp. 433–446, Apr. 2020.
[33]
M. Yu, L. Jose, and R. Miao, “Software defined traffic measurement with opensketch,” Presented at the 10th USENIX Symp. Networked Syst. Design Implement. (NSDI), 2013.
[34]
L. Tang, Q. Huang, and P. Lee, “SpreadSketch: Toward invertible and network-wide detection of superspreaders,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Toronto, ON, Canada, 2020, pp. 1608–1617.
[35]
H. Wang, C. Ma, O. O. Odegbile, S. Chen, and J.-K. Peir, “Randomized error removal for online spread estimation in data streaming,” Proc. VLDB Endowment, vol. 14, no. 6, pp. 1040–1052, Feb. 2021.
[36]
R. B. Basat, X. Chen, G. Einziger, S. L. Feibish, D. Raz, and M. Yu, “Routing oblivious measurement analytics,” in Proc. IFIP Netw. Conf. (Networking), Jun. 2020, pp. 449–457.
[37]
M. Yoon, T. Li, S. Chen, and J.-K. Peir, “Fit a compact spread estimator in small high-speed memory,” IEEE/ACM Trans. Netw., vol. 19, no. 5, pp. 1253–1264, Oct. 2011.
[38]
M. Goemans. (2021). Harmonic Number, Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Harmonic_number
[39]
S. Janson, “Tail bounds for sums of geometric and exponential variables,” Statist. Probab. Lett., vol. 135, pp. 1–6, Apr. 2018.
[40]
UCSD. (2015). Caida UCSD Anonymized 2015 Internet Traces on Jan. 17. [Online]. Available: https://www.caida.org/data/passive/passive_2015_dataset.xml

Cited By

View all
  • (2024)Enhancing Accuracy for Super Spreader Identification in High-Speed Data StreamsProceedings of the VLDB Endowment10.14778/3681954.368198817:11(3124-3137)Online publication date: 30-Aug-2024
  • (2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
  • (2023)Real-time Spread Burst Detection in Data StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899797:2(1-31)Online publication date: 22-May-2023
  • Show More Cited By

Index Terms

  1. Fast and Accurate Cardinality Estimation by Self-Morphing Bitmaps
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE/ACM Transactions on Networking
        IEEE/ACM Transactions on Networking  Volume 30, Issue 4
        Aug. 2022
        471 pages

        Publisher

        IEEE Press

        Publication History

        Published: 10 February 2022
        Published in TON Volume 30, Issue 4

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)12
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Enhancing Accuracy for Super Spreader Identification in High-Speed Data StreamsProceedings of the VLDB Endowment10.14778/3681954.368198817:11(3124-3137)Online publication date: 30-Aug-2024
        • (2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
        • (2023)Real-time Spread Burst Detection in Data StreamingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899797:2(1-31)Online publication date: 22-May-2023
        • (2023)Randomized Error Removal for Online Spread Estimation in High-Speed NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2022.319796831:2(558-573)Online publication date: 1-Apr-2023
        • (2023)An Adaptive Method for Identifying Super Nodes from Network-wide ViewJournal of Network and Systems Management10.1007/s10922-023-09745-031:3Online publication date: 9-Jun-2023

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media