Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

Published: 10 May 2021 Publication History

Abstract

We introduce Tiered Sampling, a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges.
Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate.
While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest.
We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

References

[1]
Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1446--1455.
[2]
Nesreen K. Ahmed, Nick Duffield, Theodore Willke, and Ryan A. Rossi. 2017. On sampling from massive graph streams. Proceedings of the VLDB Endowment 10, 11 (2017).
[3]
Nesreen K. Ahmed, Nick Duffield, and Liangzhen Xia. 2018. Sampling for approximate bipartite network projection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3286--3292.
[4]
Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. IEEE, 1--10.
[5]
Réka Albert and Albert-László Barabási. 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74, 1 (2002), 47.
[6]
Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. 2010. Efficient algorithms for large-scale local triangle counting. ACM Transactions on Knowledge Discovery from Data 4, 3 (2010), 1--28.
[7]
Jonathan W. Berry, Bruce Hendrickson, Randall A. LaViolette, and Cynthia A. Phillips. 2011. Tolerating the community detection resolution limit with edge weighting. Physical Review E 83, 5 (2011), 056119.
[8]
Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th International Conference on World Wide Web. 587--596.
[9]
Ilaria Bordino, Debora Donato, Aristides Gionis, and Stefano Leonardi. 2008. Mining large networks with subgraph counting. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE, 737--742.
[10]
Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2018. Motif counting beyond five nodes. ACM Transactions on Knowledge Discovery from Data 12, 4 (2018), 1--25.
[11]
Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo: Fast motif counting via succinct color coding and adaptive sampling. Proceedings of the VLDB Endowment 12, 11 (2019), 1651--1663.
[12]
Lorenzo De Stefani, Erisa Terolli, and Eli Upfal. 2017. Tiered sampling: An efficient method for approximate counting sparse motifs in massive graph streams. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data ’17). IEEE, 776--786.
[13]
Jean-Pierre Eckmann and Elisha Moses. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. Proceedings of the National Academy of Sciences 99, 9 (2002), 5825--5829.
[14]
Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra. 2018. Spotlight: Detecting anomalies in streaming graphs. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1378--1386.
[15]
Irene Finocchi, Marco Finocchi, and Emanuele G. Fusco. 2015. Clique counting in MapReduce: Algorithms and experiments. Journal of Experimental Algorithmics 20 (2015), 1--20.
[16]
Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4 (2006), 679--688.
[17]
Shweta Jain and C. Seshadhri. 2017. A fast and provable method for estimating clique counts using Turán’s theorem. In Proceedings of the 26th International Conference on World Wide Web. 441--449.
[18]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 24th International Conference on World Wide Web. 495--505.
[19]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Transactions on Knowledge Discovery from Data 9, 3 (2015), 1--21.
[20]
Konstantin Kutzkov and Rasmus Pagh. 2014. Triangle counting in dynamic graph streams. In Proceedings of the Scandinavian Workshop on Algorithm Theory. Springer, 306--318.
[21]
David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology 58, 7 (2007), 1019--1031.
[22]
Yongsub Lim and U. Kang. 2015. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 685--694.
[23]
Paul Liu, Austin R. Benson, and Moses Charikar. 2019. Sampling methods for counting temporal motifs. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 294--302.
[24]
Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: Simple building blocks of complex networks. Science 298, 5594 (2002), 824--827.
[25]
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 29--42.
[26]
Michael Mitzenmacher and Eli Upfal. 2017. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press.
[27]
Rasmus Pagh and Charalampos E. Tsourakakis. 2012. Colorful triangle counting and a MapReduce implementation. Information Processing Letters 112, 7 (2012), 277--281.
[28]
Kirill Paramonov, Dmitry Shemetov, and James Sharpnack. 2019. Estimating graphlet statistics via lifting. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 587--595.
[29]
Ha-Myung Park and Chin-Wan Chung. 2013. An efficient MapReduce algorithm for counting triangles in a very large graph. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 539--548.
[30]
A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013. Counting and Sampling Triangles from a Graph Stream. In Proceedings of the International Conference on Very Large Data Bases. 1870--1881.
[31]
Mahmudur Rahman, Mansurul Alam Bhuiyan, and Mohammad Al Hasan. 2014. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2466--2478.
[32]
Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyuce, and Srikanta Tirthapura. 2018. Butterfly counting in bipartite networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2150--2159.
[33]
Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2017. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Transactions on Knowledge Discovery from Data 11, 4 (2017), 1--50.
[34]
Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 1 (1985), 37--57.

Cited By

View all
  • (2024)Heterogeneous Network Motif Coding, Counting, and ProfilingACM Transactions on Knowledge Discovery from Data10.1145/368746518:9(1-21)Online publication date: 30-Oct-2024
  • (2023)Auxo: A Scalable and Efficient Graph Stream Summarization StructureProceedings of the VLDB Endowment10.14778/3583140.358315416:6(1386-1398)Online publication date: 1-Feb-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 5
October 2021
508 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3461317
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2021
Accepted: 01 December 2020
Revised: 01 March 2020
Received: 01 March 2019
Published in TKDD Volume 15, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graph motif mining
  2. reservoir sampling
  3. stream computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)24
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Heterogeneous Network Motif Coding, Counting, and ProfilingACM Transactions on Knowledge Discovery from Data10.1145/368746518:9(1-21)Online publication date: 30-Oct-2024
  • (2023)Auxo: A Scalable and Efficient Graph Stream Summarization StructureProceedings of the VLDB Endowment10.14778/3583140.358315416:6(1386-1398)Online publication date: 1-Feb-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media