research-article

Public Access

Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

Authors:

Lorenzo De Stefani,

Eli UpfalAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 15, Issue 5

Article No.: 79, Pages 1 - 52

https://doi.org/10.1145/3441299

Published: 10 May 2021 Publication History

All formats PDF

Abstract

We introduce Tiered Sampling, a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges.

Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate.

While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest.

We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

References

[1]

Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1446--1455.

Digital Library

[2]

Nesreen K. Ahmed, Nick Duffield, Theodore Willke, and Ryan A. Rossi. 2017. On sampling from massive graph streams. Proceedings of the VLDB Endowment 10, 11 (2017).

[3]

Nesreen K. Ahmed, Nick Duffield, and Liangzhen Xia. 2018. Sampling for approximate bipartite network projection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3286--3292.

[4]

Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, and Nick Duffield. 2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. IEEE, 1--10.

Digital Library

[5]

Réka Albert and Albert-László Barabási. 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74, 1 (2002), 47.

[6]

Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. 2010. Efficient algorithms for large-scale local triangle counting. ACM Transactions on Knowledge Discovery from Data 4, 3 (2010), 1--28.

[7]

Jonathan W. Berry, Bruce Hendrickson, Randall A. LaViolette, and Cynthia A. Phillips. 2011. Tolerating the community detection resolution limit with edge weighting. Physical Review E 83, 5 (2011), 056119.

[8]

Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th International Conference on World Wide Web. 587--596.

Digital Library

[9]

Ilaria Bordino, Debora Donato, Aristides Gionis, and Stefano Leonardi. 2008. Mining large networks with subgraph counting. In Proceedings of the 2008 8th IEEE International Conference on Data Mining. IEEE, 737--742.

Digital Library

[10]

Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and Alessandro Panconesi. 2018. Motif counting beyond five nodes. ACM Transactions on Knowledge Discovery from Data 12, 4 (2018), 1--25.

[11]

Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo: Fast motif counting via succinct color coding and adaptive sampling. Proceedings of the VLDB Endowment 12, 11 (2019), 1651--1663.

Digital Library

[12]

Lorenzo De Stefani, Erisa Terolli, and Eli Upfal. 2017. Tiered sampling: An efficient method for approximate counting sparse motifs in massive graph streams. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data ’17). IEEE, 776--786.

[13]

Jean-Pierre Eckmann and Elisha Moses. 2002. Curvature of co-links uncovers hidden thematic layers in the world wide web. Proceedings of the National Academy of Sciences 99, 9 (2002), 5825--5829.

[14]

Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra. 2018. Spotlight: Detecting anomalies in streaming graphs. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1378--1386.

Digital Library

[15]

Irene Finocchi, Marco Finocchi, and Emanuele G. Fusco. 2015. Clique counting in MapReduce: Algorithms and experiments. Journal of Experimental Algorithmics 20 (2015), 1--20.

Digital Library

[16]

Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy. International Journal of Forecasting 22, 4 (2006), 679--688.

[17]

Shweta Jain and C. Seshadhri. 2017. A fast and provable method for estimating clique counts using Turán’s theorem. In Proceedings of the 26th International Conference on World Wide Web. 441--449.

[18]

Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 24th International Conference on World Wide Web. 495--505.

Digital Library

[19]

Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Transactions on Knowledge Discovery from Data 9, 3 (2015), 1--21.

[20]

Konstantin Kutzkov and Rasmus Pagh. 2014. Triangle counting in dynamic graph streams. In Proceedings of the Scandinavian Workshop on Algorithm Theory. Springer, 306--318.

[21]

David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology 58, 7 (2007), 1019--1031.

Digital Library

[22]

Yongsub Lim and U. Kang. 2015. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 685--694.

[23]

Paul Liu, Austin R. Benson, and Moses Charikar. 2019. Sampling methods for counting temporal motifs. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 294--302.

Digital Library

[24]

Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: Simple building blocks of complex networks. Science 298, 5594 (2002), 824--827.

[25]

Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 29--42.

Digital Library

[26]

Michael Mitzenmacher and Eli Upfal. 2017. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press.

[27]

Rasmus Pagh and Charalampos E. Tsourakakis. 2012. Colorful triangle counting and a MapReduce implementation. Information Processing Letters 112, 7 (2012), 277--281.

Digital Library

[28]

Kirill Paramonov, Dmitry Shemetov, and James Sharpnack. 2019. Estimating graphlet statistics via lifting. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 587--595.

Digital Library

[29]

Ha-Myung Park and Chin-Wan Chung. 2013. An efficient MapReduce algorithm for counting triangles in a very large graph. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 539--548.

Digital Library

[30]

A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013. Counting and Sampling Triangles from a Graph Stream. In Proceedings of the International Conference on Very Large Data Bases. 1870--1881.

Digital Library

[31]

Mahmudur Rahman, Mansurul Alam Bhuiyan, and Mohammad Al Hasan. 2014. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering 26, 10 (2014), 2466--2478.

[32]

Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyuce, and Srikanta Tirthapura. 2018. Butterfly counting in bipartite networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2150--2159.

Digital Library

[33]

Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2017. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Transactions on Knowledge Discovery from Data 11, 4 (2017), 1--50.

Digital Library

[34]

Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 1 (1985), 37--57.

Digital Library

Cited By

Yu SXia FChen HLee IChi LTong H(2024)Heterogeneous Network Motif Coding, Counting, and ProfilingACM Transactions on Knowledge Discovery from Data10.1145/368746518:9(1-21)Online publication date: 30-Oct-2024
https://dl.acm.org/doi/10.1145/3687465
Jiang ZChen HJin H(2023)Auxo: A Scalable and Efficient Graph Stream Summarization StructureProceedings of the VLDB Endowment10.14778/3583140.358315416:6(1386-1398)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583154

Index Terms

Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

Recommendations

Adaptive stratified reservoir sampling over heterogeneous data streams

Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for ...
Sampling from Large Graphs with a Reservoir
NBIS '14: Proceedings of the 2014 17th International Conference on Network-Based Information Systems

Sampling is a process of choosing a suitable representative subset from a population and uniformity is a basic requirement of representative ness. A sampling process produces a uniform random sample when all possible samples of the same size have the ...
Sampling streaming data with replacement

Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 15, Issue 5

October 2021

508 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3461317

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2021

Accepted: 01 December 2020

Revised: 01 March 2020

Received: 01 March 2019

Published in TKDD Volume 15, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
371
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)24

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu SXia FChen HLee IChi LTong H(2024)Heterogeneous Network Motif Coding, Counting, and ProfilingACM Transactions on Knowledge Discovery from Data10.1145/368746518:9(1-21)Online publication date: 30-Oct-2024
https://dl.acm.org/doi/10.1145/3687465
Jiang ZChen HJin H(2023)Auxo: A Scalable and Efficient Graph Stream Summarization StructureProceedings of the VLDB Endowment10.14778/3583140.358315416:6(1386-1398)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583154

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents