Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

FURL: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Given a multigraph stream (e.g., Facebook messages or network traffic) where duplicate edges arrive continuously, how can we accurately and memory-efficiently estimate local triangles for all nodes? Local triangle counting in a graph stream is one of the fundamental tasks in graph mining with important applications including anomaly detection, social role identification, community detection, etc. Many recent graph streams include duplicate edges, hence form multigraph streams: e.g., many network packets might have a same (source, destination) pair. Although there have been several local triangle counting methods for multigraph streams, they have problems in terms of accuracy and memory efficiency; furthermore, most methods support either binary or weighted counting, and thus cannot find anomalies whose detection requires both types of counting. In this paper, we propose FURL, a memory-efficient and accurate local triangle counting method for multigraph streams. FURL has two main advantages. First, FURL improves accuracy by (1) reducing the variance of its estimation via a regularization strategy, and (2) sampling more triangles than the state-of-the-art methods do, by using its memory space efficiently. Second, FURL finds anomalies which state-of-the-art methods cannot discover. Experimental results show that FURL outperforms state-of-the-art methods in terms of accuracy and memory efficiency. Thanks to FURL, we discover interesting anomalies from a Bitcoin network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Alon N, Yuster R, Zwick U (1997) Finding and counting given length cycles. Algorithmica 17(3):209–223. https://doi.org/10.1007/BF02523189

    Article  MathSciNet  MATH  Google Scholar 

  • Becchetti L, Boldi P, Castillo C, Gionis A (2010) Efficient algorithms for large-scale local triangle counting. ACM Trans Knowl Discov Data 4(3):13:1–13:28. https://doi.org/10.1145/1839490.1839494

    Article  Google Scholar 

  • Berry JW, Hendrickson B, LaViolette RA, Phillips CA (2011) Tolerating the community detection resolution limit with edge weighting. Phys Rev E 83(5):056119. https://doi.org/10.1103/PhysRevE.83.056119

    Article  Google Scholar 

  • Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659. https://doi.org/10.1006/jcss.1999.1690

    Article  MathSciNet  MATH  Google Scholar 

  • Chou B, Suzuki E (2010) Discovering community-oriented roles of nodes in a social network. In: 12th international conference data warehousing and knowledge discovery (DAWAK), pp 52–64

  • Eckmann JP, Moses E (2002) Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc Natl Acad Sci 99(9):5825–5829. https://doi.org/10.1073/pnas.032093399

    Article  MathSciNet  Google Scholar 

  • Epasto A, Lattanzi S, Mirrokni VS, Sebe I, Taei A, Verma S (2015) Ego-net community mining applied to friend suggestion. Proc VLDB Endow 9(4):324–335. https://doi.org/10.14778/2856318.2856327

    Article  Google Scholar 

  • Feller W (1968) An introduction to probability theory and its applications, vol 1. Wiley, London

    MATH  Google Scholar 

  • Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209. https://doi.org/10.1016/0022-0000(85)90041-8

    Article  MathSciNet  MATH  Google Scholar 

  • Gentle JE (2009) Computational statistics, 1st edn. Springer, New York

    Book  MATH  Google Scholar 

  • Jha M, Pinar A, Seshadhri C (2015) Counting triangles in real-world graph streams: dealing with repeated edges and time windows. In: 49th Asilomar conference on signals, systems, and computers (ACSSC), pp 1507–1514

  • Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: Sixth ACM international conference on web search and data mining, (WSDM), pp 677–686

  • Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1–3):458–473. https://doi.org/10.1016/j.tcs.2008.07.017

    Article  MathSciNet  MATH  Google Scholar 

  • Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 685–694

  • Lim Y, Jung M, Kang U (2018) Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans Knowl Discov Data 12(1):4:1–4:28. https://doi.org/10.1145/3022186

    Article  Google Scholar 

  • Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827

    Article  Google Scholar 

  • Pagh R, Tsourakakis CE (2012) Colorful triangle counting and a mapreduce implementation. Inf Process Lett 112(7):277–281. https://doi.org/10.1016/j.ipl.2011.12.007

    Article  MathSciNet  MATH  Google Scholar 

  • Stefani LD, Epasto A, Riondato M, Upfal E (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 825–834

  • Stefani LD, Epasto A, Riondato M, Upfal E (2017) Trièst: counting local and global triangles in fully dynamic streams with fixed memory size. ACM Trans Knowl Discov Data 11(4):43:1–43:50. https://doi.org/10.1145/3059194

    Article  Google Scholar 

  • Sunter A (1977) List sequential sampling with equal or unequal probabilities without replacement. Appl Stat 26(3):261–268. https://doi.org/10.2307/2346966

    Article  MathSciNet  Google Scholar 

  • Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th international conference on world wide web (WWW), pp 607–614

  • Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) DOULION: counting triangles in massive graphs with a coin. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) vol. 1, 2009, pp 837–846

  • Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1):37–57. https://doi.org/10.1145/3147.3165

    Article  MathSciNet  MATH  Google Scholar 

  • Wang P, Qi Y, Sun Y, Zhang X, Tao J, Guan X (2017) Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. Proc VLDB Endow 11(2):162–175. https://doi.org/10.14778/3149193.3149197

    Article  Google Scholar 

  • Welser HT, Gleave E, Fisher D, Smith MA (2007) Visualizing the signatures of social roles in online discussion groups. J Soc Struct 8(2):1–32

    Google Scholar 

  • Yang Z, Wilson C, Wang X, Gao T, Zhao BY, Dai Y (2011) Uncovering social network sybils in the wild. In: Proceedings of the 11th ACM SIGCOMM internet measurement conference, (IMC), pp 259–268

Download references

Acknowledgements

This work was supported by AFOSR/AOARD under the Grant No. FA2386-16-1-4044. This work was also supported by Institute for Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No. R0190-15-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development). The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to U. Kang.

Additional information

Responsible editor: Derek Greene, Bjørn Bringmann, Elisa Fromont, Jesse Davis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Proofs of Lemmas and Theorems

Appendix A Proofs of Lemmas and Theorems

Theorem 1

Let \(\varDelta _u\) be the true local triangle count for a node u, and \(c_u\) be the estimation given by \({{\textsc {FURL}}\hbox {-}{\textsc {0}}_{\textsc {B}}}\). For every node u, \( \mathbb {E}\left[ c_u\right] = \varDelta _u. \)

Proof

Let \(T_\lambda \) be the time \(\lambda \) is formed and \(T_M\) be the last time we have the exact triangle counts. Let \(\varLambda _u^{-}\) be the set of triangles containing node u that are formed when we have the exact triangle counts, i.e. \(T_\lambda \le T_M\) , and \(\varLambda _u^{+}\) be the set of triangles containing node u that are formed when we do not have the exact triangle counts, i.e. \(T_\lambda > T_M\). Note that every triangle containing node u is included in either \(\varLambda _u^{-}\) or \(\varLambda _u^{+}\).

We define random variables \(X_\lambda ^{-}\) and \(X_\lambda ^{+}\) as follows:

$$\begin{aligned} X_{\lambda }^{-} = {\left\{ \begin{array}{ll} 1 &{} \lambda \ is \ {counted}\\ 0 &{} otherwise. \end{array}\right. } \ X_{\lambda }^{+} = {\left\{ \begin{array}{ll} q_{T_\lambda }= \frac{M-3}{M} \cdot \frac{1}{h_{max}^3} &{} \lambda \ is \ {counted}\\ 0 &{} otherwise. \end{array}\right. } \end{aligned}$$

For a triangle \(\lambda \in \varLambda _u^{-}\), \(\mathbb {E}\left[ X_{\lambda }^{-}\right] =1 \times \Pr \left[ \lambda \ is \ counted \right] = 1\) since all edges are unconditionally sampled at \(T\le T_M\).

Now we show \(\mathbb {E}\left[ X_{\lambda }^{+}\right] =1\) for each triangle \(\lambda \in \varLambda _u^{+}\). Let \(u(T) \ge M\) be the number of unique edges that have arrived so far in the stream at time T. Considering repeated edges as one edge, we get a refined stream that can be viewed as a sequence of independent random variables \(h_i(1 \le i \le u(T_\lambda ))\) that has a uniform distribution in range (0, 1). \(h_{max}\) can be viewed as the Mth smallest value among \(u(T_\lambda )\) number of random variables since D keeps the M minimum values. Then \(h_{max} \sim Beta(M,u(T_\lambda )+1-M)\) by the rule of kth order statistic in uniform distribution (Gentle 2009). Thus its probability density function \(f(x)=\frac{\gamma (u(T_\lambda )+1)}{\gamma (M) \cdot \gamma (u(T_\lambda )+1-M)} \cdot x^{M-1} \cdot (1-x)^{u(T_\lambda )-M}\). \(\Pr \left[ \lambda ~is ~counted \right] = \frac{M(M-1)(M-2)}{u(T_\lambda )(u(T_\lambda )-1)(u(T_\lambda )-2)}\) since it is the probability that 3 edges of \(\lambda \) are stored in the buffer D simultaneously. Note that \(\Pr \left[ \lambda ~is ~counted \right] \) is independent of the event \(h_{max}=x\).

$$\begin{aligned} \mathbb {E}\left[ X_{\lambda }^{+}\right]&= \mathbb {E}\left[ \mathbb {E}\left[ X_{\lambda }^{+}|h_{max}\right] \right] \\&= \int _{0}^{1} \Pr \left[ \lambda {\textit{is counted}}|h_{max}=x \right] \cdot q_{T_\lambda } \cdot f(x) dx\\&= \frac{M(M-1)(M-2)}{u(T_\lambda )(u(T_\lambda )-1)(u(T_\lambda )-2)} \cdot \int _{0}^{1} \frac{M-3}{M} \cdot \frac{1}{x^3} \cdot f(x) dx = 1 \end{aligned}$$

Therefore, \( \mathbb {E}\left[ c_u\right] = \sum \limits _{\lambda \in \varLambda _u^{-}} \mathbb {E}\left[ X_{\lambda }^{-}\right] + \sum \limits _{\lambda \in \varLambda _u^{+}} \mathbb {E}\left[ X_{\lambda }^{+}\right] = \varDelta _u. \)\(\square \)

Theorem 2

Let \(\varDelta _u\) be the true local triangle count for a node u, and \(c_u\) be the estimation given by \({{\textsc {FURL}}\hbox {-}{\textsc {0}}_{\textsc {W}}}\). For every node u, \( \mathbb {E}\left[ c_u\right] = \varDelta _u. \)

Proof

We define random variables \(X_\lambda ^{-}\) and \(X_\lambda ^{+}\) which have 1 and \(q_{T_\lambda }=\frac{M-2}{M} \cdot \frac{1}{h_{max}^2}\) respectively if \(\lambda \) is counted, or 0 otherwise.

$$\begin{aligned} X_{\lambda }^{-} = {\left\{ \begin{array}{ll} 1 &{} \lambda \ is \ {counted}\\ 0 &{} otherwise. \end{array}\right. } \ X_{\lambda }^{+} = {\left\{ \begin{array}{ll} q_{T_\lambda }= \frac{M-2}{M} \cdot \frac{1}{h_{max}^2} &{} \lambda \ is \ {counted}\\ 0 &{} otherwise. \end{array}\right. } \end{aligned}$$

Then, \( \mathbb {E}\left[ c_u\right] = \sum \limits _{\lambda \in \varLambda _u^{-}} \mathbb {E}\left[ X_{\lambda }^{-}\right] + \sum \limits _{\lambda \in \varLambda _u^{+}} \mathbb {E}\left[ X_{\lambda }^{+}\right] \).

In \({{\textsc {FURL}}\hbox {-}{\textsc {0}}_{\textsc {W}}}\), although \(\lambda \) updates the estimation, the buffer does not meet a new edge at \(T_\lambda \) yet since a triangle is updated before sampling the edge. Then \(h_{max}\) can be viewed as the Mth smallest value among \(u(T_\lambda -1)\) number of random variables in range (0, 1) and \(h_{max} \sim Beta(M,u(T_\lambda -1)+1-M)\). Thus its probability density function is given as \(g(x)=\frac{\gamma (u(T_\lambda -1)+1)}{\gamma (M) \cdot \gamma (u(T_\lambda -1)+1-M)} \cdot x^{M-1} \cdot (1-x)^{u(T_\lambda -1)-M}\).

The probability that a triangle \(\lambda \) is counted is \(\frac{M(M-1)}{u(T_\lambda -1)(u(T_\lambda -1)-1)}\) because \(\lambda \) is counted if an arriving edge e forms \(\lambda \) with the other two edges in the buffer regardless of sampling e.

$$\begin{aligned} \mathbb {E}\left[ X_{\lambda }^{+}\right]&= \Pr \left[ \lambda ~is ~counted \right] \cdot \int _{0}^{1} q_{(T_\lambda -1)} \cdot g(x) dx \\&= \frac{M(M-1)}{u(T_\lambda -1)(u(T_\lambda -1)-1)} \cdot \int _{0}^{1} \frac{M-2}{M} \cdot \frac{1}{x^2} \cdot g(x) dx\\&=\int _{0}^{1} \frac{\gamma (u(T_\lambda -1)-1)}{\gamma (M-2) \gamma (u(T_\lambda -1)+1-M)} \cdot x^{M-3} (1-x)^{u(T_\lambda -1)-M}dx = 1 \end{aligned}$$

Therefore, \( \mathbb {E}\left[ c_u\right] = \sum \limits _{\lambda \in \varLambda _u^{-}} \mathbb {E}\left[ X_{\lambda }^{-}\right] + \sum \limits _{\lambda \in \varLambda _u^{+}} \mathbb {E}\left[ X_{\lambda }^{+}\right] = \varDelta _u. \)\(\square \)

Lemma 1

Let \(b_\lambda > 0\) be the bucket where a triangle \(\lambda \) is formed and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) by \({{\textsc {FURL}}_{\textsc {B}}}\). Then,

$$\begin{aligned} \mathbb {E}\left[ Y_\lambda \right] = 1 - \phi (b_\lambda ), \end{aligned}$$

where \(\phi (i) = \delta ^{b-i+1}\) and b is the current bucket.

Proof

Let \(W(i)=(1-\delta )\delta ^{b-i}\) for \(i\ge 1\) and \(q_{T_\lambda } = \frac{(T-1)(T-2)}{M(M-1)}\). By definition, for \(\lambda \) appearing in bucket \(b_\lambda \), \(Y_\lambda \) becomes \(q_{T_\lambda } W(b_\lambda ) + q_{T_\lambda } W(b_\lambda +1) + \cdots + q_{T_\lambda } W(b)\) if \(\lambda \) is counted; if \(\lambda \) is not counted, \(Y_\lambda =0\). Thus, we obtain

$$\begin{aligned} \mathbb {E}\left[ Y_\lambda \right] = \Pr \left[ \lambda \ is \ counted \right] \cdot \left( q_{T_\lambda }\sum _{b_\lambda \le j\le b} W(j)\right) = 1 - \phi (b_\lambda ). \end{aligned}$$

\(\square \)

Lemma 2

Let \(b_\lambda > 0\) be the bucket where a triangle \(\lambda \) is formed and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) by \({{\textsc {FURL}}_{\textsc {B}}}\). Then,

$$\begin{aligned} \mathrm {Var}\left[ Y_\lambda \right] = \left( 1-\phi (b_\lambda )\right) ^2 \left( k_{T_\lambda } - 1 \right) , \end{aligned}$$

where \(T_\lambda \) is the first time all three edges of \(\lambda \) arrive, \(k_{T_\lambda } = \frac{(M-3)(u(T_\lambda )-3)(u(T_\lambda )-4)(u(T_\lambda )-5)}{M(M-4)(M-5)(M-6)}\), \(\phi (i) = \delta ^{b-i+1}\), and b is the current bucket.

Proof

Following the definition \(\mathrm {Var}\left[ Y_\lambda \right] = \mathbb {E}\left[ Y_\lambda ^2\right] - \mathbb {E}\left[ Y_\lambda \right] ^2\) with Lemma 1, the proof is done. \(\square \)

Theorem 3

Let \(Y_\lambda \) and \(X_\lambda \) be the estimated counts of a triangle \(\lambda \) by \({{\textsc {FURL}}_{\textsc {B}}}\) and \({{\textsc {FURL}}\hbox {-}{\textsc {0}}_{\textsc {B}}}\), respectively. Consider any triangle \(\lambda \) that is counted at time \(T_\lambda > T_M\). Let u(T) be the number of unique edges that have arrived at time T. If \(u(T_\lambda ) \ge \root 3 \of {\frac{1+\alpha }{\alpha }}M + 3\), the interval by \(\mathbb {E}\left[ Y_\lambda \right] \pm \alpha \cdot \mathrm {Var}\left[ Y_\lambda \right] \) is strictly included in that by \(\mathbb {E}\left[ X_\lambda \right] \pm \alpha \mathrm {Var}\left[ X_\lambda \right] \) for any \(\alpha \).

Proof

We first show that \(\mathbb {E}\left[ Y_u\right] - \alpha \cdot \mathrm {Var}\left[ Y_u\right] > \mathbb {E}\left[ X_u\right] - \alpha \cdot \mathrm {Var}\left[ X_u\right] \).

Let \(\psi _\lambda =1-\delta ^{b-b_\lambda +1}\); we will find the condition satisfying \( \psi _\lambda - \alpha \psi _\lambda ^2\left( k_{T_\lambda }-1\right) - 1 + \alpha (k_{T_\lambda }-1) > 0, \) which is developed as follows.

$$\begin{aligned} \psi _\lambda&> \frac{1-\alpha k_{T_\lambda }+\alpha }{\alpha k_{T_\lambda }-\alpha }. \end{aligned}$$
(1)

Below, we will show the condition under which Eq. (1) holds. Let \(u(T_\lambda ) - 3 = \beta M\). By definition,

$$\begin{aligned} k_{T_\lambda }&> \frac{(M-3)(u(T_\lambda )-3)^3}{M(M-4)^3} > \frac{(u(T_\lambda )-3)^3}{M^3} = \beta ^3. \end{aligned}$$

Then,

$$\begin{aligned} \frac{1-\alpha k_{T_\lambda }+\alpha }{\alpha k_{T_\lambda }-\alpha }&< \frac{1-\alpha \beta ^3+\alpha }{\alpha \beta ^3-\alpha } = \frac{1}{\alpha \beta ^3-\alpha } - 1. \end{aligned}$$
(2)

Now we examine the left term of Eq. (1). Since \(1\le b_\lambda \le b\), the lower bound of \(\psi _\lambda \) becomes \(\psi _\lambda \ge 1 - \delta \). Then, we obtain a sufficient condition for (1) as follows:

$$\begin{aligned} \beta ^3&\ge \frac{1+2\alpha -\alpha \delta }{2\alpha -\alpha \delta } \end{aligned}$$
(3)

For \(\beta \ge \root 3 \of {\frac{1+\alpha }{\alpha }}\), Eq. (3) always holds since the upper bound of the right term becomes \(\frac{1+\alpha }{\alpha }\). Note that \(0\le \delta <1\). Thus, we finally obtain the condition under which Eq. (1) holds as follows:

$$\begin{aligned} u(T_\lambda ) \ge \root 3 \of {\frac{1+\alpha }{\alpha }}M + 3. \end{aligned}$$

Due to the underestimation of \({{\textsc {FURL}}_{\textsc {B}}}\), \(\mathbb {E}\left[ Y_\lambda \right] + \alpha \cdot \mathrm {Var}\left[ Y_\lambda \right] < \mathbb {E}\left[ X_\lambda \right] + \alpha \cdot \mathrm {Var}\left[ X_\lambda \right] \) always holds, which completes the proof. \(\square \)

Lemma 3

Let \(b_\lambda \) be the bucket where \(\lambda \) is formed and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) with \(b_\lambda >0\) by \({{\textsc {FURL}}_{\textsc {W}}}\). For every triangle \(\lambda \),

$$\begin{aligned} \mathbb {E}\left[ Y_\lambda \right] = 1 - \phi (b_\lambda ), \end{aligned}$$

where \(\phi (i) = \delta ^{b-i+1}\) and b is the current bucket.

Proof

The lemma is proved in the same way as in Lemma 1. \(\square \)

Lemma 4

Let \(b_\lambda \) be the bucket where \(\lambda \) is formed and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) with \(b_\lambda >0\) by \({{\textsc {FURL}}_{\textsc {W}}}\). For every triangle \(\lambda \),

$$\begin{aligned} \mathrm {Var}\left[ Y_\lambda \right] = \left( 1-\phi (b_\lambda )\right) ^2 \left( l_{T_\lambda } - 1 \right) , \end{aligned}$$

where \(T_\lambda \) is the first time all three edges of \(\lambda \) arrive, \(l_{T_\lambda } = \frac{(M-2)(u(T_\lambda -1)-2)(u(T_\lambda -1)-3)}{M(M-3)(M-4)}\), \(\phi (i) = \delta ^{b-i+1}\), and b is the current bucket.

Proof

The lemma is proved in the same way as in Lemma 2. \(\square \)

Theorem 4

Let \(Y_\lambda \) and \(X_\lambda \) be the estimated counts of a triangle \(\lambda \) by \({{\textsc {FURL}}_{\textsc {W}}}\) and \({{\textsc {FURL}}\hbox {-}{\textsc {0}}_{\textsc {W}}}\), respectively. Consider any triangle \(\lambda \) that is counted at time \(T_\lambda > T_M\). If \(u(T_\lambda -1) \ge \sqrt{\frac{1+\alpha }{\alpha }}M + 2\), the interval by \(\mathbb {E}\left[ Y_\lambda \right] \pm \alpha \cdot \mathrm {Var}\left[ Y_\lambda \right] \) is strictly included in that by \(\mathbb {E}\left[ X_\lambda \right] \pm \alpha \cdot \mathrm {Var}\left[ X_\lambda \right] \) for any \(\alpha \).

Proof

We first show that \(\mathbb {E}\left[ Y_\lambda \right] - \alpha \cdot \mathrm {Var}\left[ Y_\lambda \right] > \mathbb {E}\left[ X_\lambda \right] - \alpha \cdot \mathrm {Var}\left[ X_\lambda \right] \).

Let \(\psi _\lambda =1-\delta ^{B-b_\lambda +1}\); we will find the condition satisfying:

$$\begin{aligned} \psi _\lambda&> \frac{1-\alpha l_{T_\lambda }+\alpha }{\alpha l_{T_\lambda }-\alpha }. \end{aligned}$$
(4)

Let \(u(T_\lambda -1) - 2 = \beta M\). Then,

$$\begin{aligned} \frac{1-\alpha l_{T_\lambda }+\alpha }{\alpha l_{T_\lambda }-\alpha } < \frac{1}{\alpha \beta ^2-\alpha } - 1. \ \left( \because \ l_{T_\lambda } > \frac{(u(T_\lambda -1)-2)^2}{M^2} = \beta ^2.\right) \end{aligned}$$

Since \(1\le b_\lambda \le B\), the lower bound of \(\psi _\lambda \) becomes \(\psi _\lambda \ge 1 - \delta \). Then, we obtain a sufficient condition for (4) as follows:

$$\begin{aligned} \beta ^2&\ge \frac{1+2\alpha -\alpha \delta }{2\alpha -\alpha \delta } \end{aligned}$$
(5)

For \(\beta \ge \sqrt{\frac{1+\alpha }{\alpha }}\), Eq. (5) always holds since the upper bound of the right term becomes \(\frac{1+\alpha }{\alpha }\). Note that \(0\le \delta <1\). Thus, we finally obtain the condition under which Eq. (4) holds as follows:

$$\begin{aligned} u(T_\lambda -1) \ge \sqrt{\frac{1+\alpha }{\alpha }}M + 2. \end{aligned}$$

Due to the underestimation of \({{\textsc {FURL}}_{\textsc {W}}}\), \(\mathbb {E}\left[ Y_\lambda \right] + \alpha \cdot \mathrm {Var}\left[ Y_\lambda \right] < \mathbb {E}\left[ X_\lambda \right] + \alpha \cdot \mathrm {Var}\left[ X_\lambda \right] \) always holds, which completes the proof. \(\square \)

Lemma 5

Let \(\varDelta _u\) be the true local triangle count for a node u and \(Y_u\) be the estimation given by FURL for node u. Then,

$$\begin{aligned} \left( 1 - \delta \right) \varDelta _u < \mathbb {E}\left[ Y_u\right] \le \varDelta _u. \end{aligned}$$

Proof

Let \(X_\lambda \) and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) by FURL-0 and FURL respectively, \(T_\lambda \) be the time \(\lambda \) is formed, and \(T_M\) be the last time we have the exact triangle counts. Let \(\varLambda _u^{-}\) be the set of triangles containing node u that are formed when we have the exact triangle counts, i.e. \(T_\lambda \le T_M\), and \(\varLambda _u^{+}\) be the set of triangles containing node u that are formed when we do not have the exact triangle counts, i.e. \(T_\lambda > T_M\). Note that every triangle containing node u is included in either \(\varLambda _u^{-}\) or \(\varLambda _u^{+}\).

For triangle \(\lambda ^{-} \in \varLambda _u^{-}\),

$$\begin{aligned} Y_{\lambda ^{-}} = X_{\lambda ^{-}} \end{aligned}$$

and for triangle \(\lambda ^{+} \in \varLambda _u^{+}\),

$$\begin{aligned} Y_{\lambda ^{+}} = W(b_{\lambda ^{+}})X_{\lambda ^{+}} + W(b_{\lambda ^{+}}+1)X_{\lambda ^{+}} + \cdots + W(b)X_{\lambda ^{+}} = \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) X_{\lambda ^{+}} \end{aligned}$$

where \(b_\lambda \) is the bucket where a triangle \(\lambda \) is formed and \(W(i)=(1-\delta )\delta ^{b-i}\). Then,

$$\begin{aligned} \mathbb {E}\left[ Y_u\right]&= \sum _{\lambda ^{-} \in \varLambda _u^{-}} \mathbb {E}\left[ X_{\lambda ^{-}}\right] + \sum _{\lambda ^{+} \in \varLambda _u^{+}} \mathbb {E}\left[ \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) X_{\lambda ^{+}}\right] \\&= \sum _{\lambda ^{-} \in \varLambda _u^{-}} 1 + \sum _{\lambda ^{+} \in \varLambda _u^{+}} \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) . \end{aligned}$$

\(1 \le b_{\lambda ^{+}} \le b\) and \(0< \delta < 1\). Thus,

$$\begin{aligned} \left( 1 - \delta \right) \varDelta _u < \mathbb {E}\left[ Y_u\right] \le \varDelta _u. \end{aligned}$$

\(\square \)

Lemma 6

Let \(X_u\) and \(Y_u\) be the estimated local triangle count for a node u given by FURL-0 and FURL respectively, \(\varLambda _u\) be the set of triangles containing node u.

$$\begin{aligned} \left( 1 - \delta \right) ^2\mathrm {Var}\left[ X_u\right] \le \mathrm {Var}\left[ Y_u\right] \le \left( 1 - \delta ^b\right) ^2\mathrm {Var}\left[ X_u\right] \end{aligned}$$

where b is the current bucket.

Proof

Let \(X_\lambda \) and \(Y_\lambda \) be the estimated count of a triangle \(\lambda \) by FURL-0 and FURL respectively. Let \(\varLambda _u^{-}\) be the set of triangles containing node u that are formed when we have the exact triangle counts and \(\varLambda _u^{+}\) be the set of triangles containing node u that are formed when we do not have the exact triangle counts.

For triangle \(\lambda ^{-} \in \varLambda _u^{-}\),

$$\begin{aligned} Y_{\lambda ^{-}} - \mathbb {E}\left[ Y_{\lambda ^{-}}\right] = X_{\lambda ^{-}} - \mathbb {E}\left[ X_{\lambda ^{-}}\right] = 1 - 1 = 0 \end{aligned}$$

and for triangle \(\lambda ^{+} \in \varLambda _u^{+}\),

$$\begin{aligned} Y_{\lambda ^{+}} - \mathbb {E}\left[ Y_{\lambda ^{+}}\right]= & {} \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) X_{\lambda ^{+}} - \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) \mathbb {E}\left[ X_{\lambda ^{+}}\right] \nonumber \\= & {} \left( 1 - \delta ^{b-b_{\lambda ^{+}}+1}\right) \left( X_{\lambda ^{+}} - 1\right) \end{aligned}$$

where \(b_\lambda \) is the bucket where a triangle \(\lambda \) is formed. Let \(\varLambda _u\) be the set of triangles containing node u. Then,

$$\begin{aligned} \mathrm {Var}\left[ Y_u\right]&= \sum _{\lambda _1 \in \varLambda _u} \sum _{\begin{array}{c} \lambda _2 \in \varLambda _u,\\ \lambda _1 \ne \lambda _2 \end{array}} \mathrm {Cov}\left[ Y_{\lambda _1},Y_{\lambda _2}\right] \\&= \sum _{\lambda _1 \in \varLambda _u^{+}} \sum _{\begin{array}{c} \lambda _2 \in \varLambda _u^{+},\\ \lambda _1 \ne \lambda _2 \end{array}} \mathrm {Cov}\left[ Y_{\lambda _1},Y_{\lambda _2}\right] \ (\because Y_{\lambda ^{-}} - \mathbb {E}\left[ Y_{\lambda ^{-}}\right] = 0) \\&= \sum _{\lambda _1 \in \varLambda _u^{+}} \sum _{\begin{array}{c} \lambda _2 \in \varLambda _u^{+},\\ \lambda _1 \ne \lambda _2 \end{array}} \left( 1 - \delta ^{b-b_{\lambda _1}+1}\right) \left( 1 - \delta ^{b-b_{\lambda _2}+1}\right) \mathrm {Cov}\left[ X_{\lambda _1},X_{\lambda _2}\right] . \end{aligned}$$

\(1 \le b_{\lambda ^{+}} \le b\) and \(0< \delta < 1\). Thus,

$$\begin{aligned} \left( 1 - \delta \right) ^2\mathrm {Var}\left[ X_u\right] \le \mathrm {Var}\left[ Y_u\right] \le \left( 1 - \delta ^b\right) ^2\mathrm {Var}\left[ X_u\right] . \end{aligned}$$

\(\square \)

Theorem 5

Let u(T) be the number of unique edges that have arrived at time T. If \(\delta ^b > 1 - \sqrt{1 - \frac{\mathbb {E}\left[ X_u\right] }{\alpha \mathrm {Var}\left[ X_u\right] }}\), the interval by \(\mathbb {E}\left[ Y_u\right] \pm \alpha \mathrm {Var}\left[ Y_u\right] \) is strictly included in that by \(\mathbb {E}\left[ X_u\right] \pm \alpha \mathrm {Var}\left[ X_u\right] \) for any \(\alpha \).

Proof

We first show that \(\mathbb {E}\left[ Y_u\right] - \alpha \mathrm {Var}\left[ Y_u\right] > \mathbb {E}\left[ X_u\right] - \alpha \mathrm {Var}\left[ X_u\right] \). By Lemmas 5 and 6,

$$\begin{aligned} \left( 1 - \delta \right) \varDelta _u - \left( 1 - \delta ^b\right) ^2\alpha \mathrm {Var}\left[ X_u\right] > \varDelta _u - \alpha \mathrm {Var}\left[ X_u\right] \end{aligned}$$

which is developed as follows.

$$\begin{aligned} 0 > \delta ^{2b} - 2\delta ^b + \frac{\delta \varDelta _u}{\alpha \mathrm {Var}\left[ X_u\right] }. \end{aligned}$$

\(0< \delta < 1\). Then, we obtain a sufficient condition as follows:

$$\begin{aligned} 1 - \frac{\varDelta _u}{\alpha \mathrm {Var}\left[ X_u\right] } > \left( \delta ^{b} - 1\right) ^2. \end{aligned}$$

Thus, we finally obtain the condition as follows:

$$\begin{aligned} \delta ^{b} > 1 - \sqrt{1 - \frac{\varDelta _u}{\alpha \mathrm {Var}\left[ X_u\right] }}. \end{aligned}$$

Due to the underestimation of FURL-0, \(\mathbb {E}\left[ Y_u\right] + \alpha \mathrm {Var}\left[ Y_u\right] < \mathbb {E}\left[ X_u\right] + \alpha \mathrm {Var}\left[ X_u\right] \) always holds, which completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jung, M., Lim, Y., Lee, S. et al. FURL: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min Knowl Disc 33, 1225–1253 (2019). https://doi.org/10.1007/s10618-019-00630-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00630-6

Keywords

Navigation