Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage

Published: 01 October 2017 Publication History

Abstract

Counting triangles in a large graph is important for detecting network anomalies such as spam web pages and suspicious accounts (e.g., fraudsters and advertisers) on online social networks. However, it is challenging to compute the number of triangles in a large graph represented as a stream of edges with a low computational cost when given a limited memory. Recently, several effective sampling-based approximation methods have been developed to solve this problem. However, they assume the graph stream of interest contains no duplicate edges, which does not hold in many real-world graph streams (e.g., phone calling networks). In this paper, we observe that these methods exhibit a large estimation error or computational cost even when modified to deal with duplicate edges using deduplication techniques such as Bloom filter and hash-based sampling. To solve this challenge, we design a one-pass streaming algorithm for uniformly sampling distinct edges at a high speed. Compared to state-of-the-art algorithms, our algorithm reduces the sampling cost per edge from O(log k) (k is the maximum number of sampled edges determined by the available memory space) to O(1) without using any additional memory space. Based on sampled edges, we develop a simple yet accurate method to infer the number of triangles in the original graph stream. We conduct extensive experiments on a variety of real-world large graphs, and the results demonstrate that our method is several times more accurate and faster than state-of-the-art methods with the same memory usage.

References

[1]
N. Ahmed, N. Duffield, J. Neville, and R. Kompella. Graph sample and hold: A framework for big-graph analytics. In SIGKDD, pages 1446--1455, 2014.
[2]
N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 17:354--364, 1997.
[3]
S. Arifuzzaman, M. Khan, and M. Marathe. Patric: A parallel algorithm for counting triangles in massive networks. In CIKM, pages 529--538, 2013.
[4]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002.
[5]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In SODA, pages 623--632, 2002.
[6]
L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient algorithms for large-scale local triangle counting. TKDD, 4(3):13, 2010.
[7]
J. W. Berry, B. Hendrickson, R. A. LaViolette, and C. A. Phillips. Tolerating the community detection resolution limit with edge weighting. Physical Review E, 83(5):056119+, 2011.
[8]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422--426, 1970.
[9]
L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting triangles in data streams. In PODS, pages 253--262, 2006.
[10]
H. Chun, Y. yeol Ahn, H. Kwak, S. Moon, Y. ho Eom, and H. Jeong. Comparison of online social relations in terms of volume vs. interaction: A case study of cyworld. In IMC, pages 57--70, 2008.
[11]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997.
[12]
E. Cohen. All-distances sketches, revisited: Hip estimators for massive graphs analysis. In PODS, pages 2320--2334, 2014.
[13]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.
[14]
J.-P. Eckmann and E. Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS, 99(9):5825--5829, 2002.
[15]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA, pages 127--146, 2007.
[16]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985.
[17]
I. Giechaskiel, G. Panagopoulos, and E. Yoneki. PDTL: parallel and distributed triangle listing for massive graphs. In ICPP, pages 370--379, 2015.
[18]
X. Hu, Y. Tao, and C.-W. Chung. Massive graph triangulation. In SIGMOD, pages 325--336, 2013.
[19]
M. Jha, A. Pinar, and C. Seshadhri. Counting triangles in real-world graph streams: Dealing with repeated edges and time windows. In ACSSC, pages 1507--1514, 2015.
[20]
M. Jha, C. Seshadhri, and A. Pinar. A space efficient streaming algorithm for triangle counting using the birthday paradox. In SIGKDD, pages 589--597, 2013.
[21]
H. Jowhari and M. Ghodsi. New streaming algorithms for counting triangles in graphs. In COCOON, pages 710--716, 2005.
[22]
M. Jung, S. Lee, Y. Lim, and U. Kang. FURL: fixed-memory and uncertainty reducing local triangle counting for graph streams. CoRR, abs/1611.06615, 2016.
[23]
U. Kang, B. Meeder, E. E. Papalexakis, and C. Faloutsos. Heigen: Spectral analysis for billion-scale graphs. TKDE, 26(2):350--362, 2014.
[24]
J. Kim, W.-S. Han, S. Lee, K. Park, and H. Yu. Opt: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD, pages 637--648, 2014.
[25]
J. Kunegis. Handbook of network analysis {KONECT - the koblenz network collection}. CoRR, abs/1402.5500:1343--1350, 2014.
[26]
K. Kutzkov and R. Pagh. On the streaming complexity of computing local clustering coefficients. In WSDM, pages 677--686, 2013.
[27]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600, 2010.
[28]
M. Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. TCS, 407(1--3):458--473, 2008.
[29]
J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, pages 641--650, 2010.
[30]
Y. Lim and U. Kang. MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In SIGKDD, pages 685--694, 2015.
[31]
R. Milo, E. Al, and C. Biology. Network motifs: Simple building blocks of complex networks. Science, 298(5549):824--827, 2002.
[32]
A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007.
[33]
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005.
[34]
R. Pagh and F. Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224--233, 2014.
[35]
H. Park, S. Myaeng, and U. Kang. PTE: enumerating trillion triangles on distributed systems. In SIGKDD, pages 1115--1124, 2016.
[36]
H.-M. Park and C.-W. Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In CIKM, pages 539--548, 2013.
[37]
H.-M. Park, F. Silvestri, U. Kang, and R. Pagh. Mapreduce triangle enumeration with guarantees. In CIKM, pages 1739--1748, 2014.
[38]
A. Pavany, K. Tangwongsan, S. Tirthapuraz, and K.-L. Wu. Counting and sampling triangles from a graph stream. In PVLDB, 6(14):1870--1881, 2013.
[39]
T. Schank. Algorithmic aspects of triangle-based network analysis. Phd in Computer Science, 2007.
[40]
T. Schank and D. Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In WEA, pages 606--609, 2005.
[41]
D. Schiöberg, F. Schneider, S. Schmid, S. Uhlig, and A. Feldmann. Evolution of directed triangle motifs in the google+ OSN. CoRR, abs/1502.04321, 2015.
[42]
C. Seshadhri, A. Pinar, N. Durak, and T. G. Kolda. Directed closure measures for networks with reciprocity. J. Complex Networks, 5(1):32--47, 2017.
[43]
C. Seshadhri, A. Pinar, and T. G. Kolda. Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Statistical Analysis and Data Mining, 7(4):294--307, 2014.
[44]
L. D. Stefani, A. Epasto, M. Riondato, and E. Upfal. Trièst: Counting local and global triangles in fully-dynamic streams with fixed memory size. In SIGKDD, pages 825--834, 2016.
[45]
L. D. Stefani, A. Epasto, M. Riondato, and E. Upfal. Trièst: Counting local and global triangles in fully-dynamic streams with fixed memory size. CoRR, abs/1602.07424, 2016.
[46]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.
[47]
D. Ting. Streamed approximate counting of distinct elements: Beating optimal batch methods. In SIGKDD, pages 442--451, 2014.
[48]
C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: Counting triangles in massive graphs with a coin. In KDD, pages 837--846, 2009.
[49]
J. S. Vitter. Random sampling with a reservoir. TOMS, 11(1):37--57, 1985.
[50]
H. T. Welser, E. Gleave, D. Fisher, and M. Smith. Visualizing the signatures of social roles in online discussion groups. JoSS, 8(2):1--32, 2007.
[51]
B. Wu, K. Yi, and Z. Li. Counting triangles in large graphs by random sampling. TKDE, 28(8):2013--2026, 2016.
[52]
Z. Yang, C. Wilson, X. Wang, T. Gao, B. Y. Zhao, and Y. Dai. Uncovering social network sybils in the wild. TKDD, 8(1):1--29, 2014.
[53]
H. Zhang, Y. Zhu, L. Qin, H. Cheng, and J. X. Yu. Efficient triangle listing for billion-scale graphs. In BigData, pages 813--822, 2016.

Cited By

View all
  • (2024)LM-SRPQ: Efficiently Answering Regular Path Query in Streaming GraphsProceedings of the VLDB Endowment10.14778/3641204.364121417:5(1047-1059)Online publication date: 1-Jan-2024
  • (2024)Sting: Near-storage accelerator framework for scalable triangle counting and beyondProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3658265(1-6)Online publication date: 23-Jun-2024
  • (2024)FABLE: Approximate Butterfly Counting in Bipartite Graph Stream with Duplicate EdgesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679812(2158-2167)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 2
October 2017
176 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2017
Published in PVLDB Volume 11, Issue 2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LM-SRPQ: Efficiently Answering Regular Path Query in Streaming GraphsProceedings of the VLDB Endowment10.14778/3641204.364121417:5(1047-1059)Online publication date: 1-Jan-2024
  • (2024)Sting: Near-storage accelerator framework for scalable triangle counting and beyondProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3658265(1-6)Online publication date: 23-Jun-2024
  • (2024)FABLE: Approximate Butterfly Counting in Bipartite Graph Stream with Duplicate EdgesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679812(2158-2167)Online publication date: 21-Oct-2024
  • (2023)Triangular Stability Maximization by Influence Spread over Social NetworksProceedings of the VLDB Endowment10.14778/3611479.361149016:11(2818-2831)Online publication date: 24-Aug-2023
  • (2023)Sliding window-based approximate triangle counting with bounded memory usageThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00783-332:5(1087-1110)Online publication date: 9-Mar-2023
  • (2022)sGrapp: Butterfly Approximation in Streaming GraphsACM Transactions on Knowledge Discovery from Data10.1145/349501116:4(1-43)Online publication date: 8-Jan-2022
  • (2022)Top-k heavy weight triangles listing on graph streamWorld Wide Web10.1007/s11280-022-01117-z26:4(1827-1851)Online publication date: 17-Nov-2022
  • (2022)Four node graphlet and triad enumeration on distributed platformsDistributed and Parallel Databases10.1007/s10619-022-07416-840:2-3(335-372)Online publication date: 1-Sep-2022
  • (2022)Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costsKnowledge and Information Systems10.1007/s10115-022-01691-864:7(1909-1935)Online publication date: 1-Jul-2022
  • (2021)Sliding Window-based Approximate Triangle Counting over Streaming Graphs with Duplicate EdgesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452800(645-657)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media