Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Privacy and efficiency guaranteed social subgraph matching

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Due to the increasing cost of data storage and computation, more and more graphs (e.g., web graphs, social networks) are outsourced and analyzed in the cloud. However, there is growing concern on the privacy of these outsourced graphs at the hands of untrusted cloud providers. Unfortunately, simple label anonymization cannot protect nodes from being re-identified by adversary who knows the graph structure. To address this issue, existing works adopt the k-automorphism model, which constructs \((k-1)\) symmetric vertices for each vertex. It has two disadvantages. First, it significantly enlarges the graphs, which makes graph mining tasks such as subgraph matching extremely inefficient and sometimes infeasible even in the cloud. Second, it cannot protect the privacy of attributes in each node. In this paper, we propose a new privacy model (kt)-privacy that combines the k-automorphism model for graph structure with the t-closeness privacy model for node label generalization. Besides a stronger privacy guarantee, the paper also optimizes the matching efficiency by (1) an approximate label generalization algorithm TOGGLE with \((1+\epsilon )\) approximation ratio and (2) a new subgraph matching algorithm PGP on succinct k-automorphic graphs without decomposing the query graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://aws.amazon.com/compliance/hipaa-compliance/.

  2. https://www.oracle.com/database/graph/.

  3. One can argue that differential privacy [11] is more stringent than t-closeness as the former is defined regardless of the underlying dataset or a priori knowledge. However, it is infeasible in subgraph matching where exact matchings are desirable.

  4. We will directly use the notation \((v_1,v_2,\ldots v_i,\ldots v_j\ldots v_n)\) to denote the uniform distribution where each value is equally likely. For sorted numerical values \((v_1,v_2,\ldots v_i,\ldots v_j\ldots v_n)\), the ground distance of \(v_i\) and \(v_j\) is \(\frac{|i-j|}{n-1}\) [10].

  5. When n is small, we can enumerate all feasible subsets and relax the constraint \(y_{i,j} \in \{0,1\}\) to \(y_{i,j} \in [0,1]\). Then, we apply the Simplex method to solve this linear programming problem. Finally, we employ the Branch-and-Bound method to obtain the integer solution [27].

References

  1. Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD, pp. 1199–1214 (2016)

  2. Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: Graphlab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014)

  3. Chang, Z., Zou, L., Li, F.: Privacy preserving subgraph matching on large graphs in cloud. In: SIGMOD, pp. 199–213 (2016)

  4. Cao, N., Yang, Z., Wang, C., Ren, K., Lou, W.: Privacy-preserving query over encrypted graph-structured data in cloud computing. In: ICDCS, pp. 393–402 (2011)

  5. Hu, H., Xu, J., Chen, Q. et al.: Authenticating location-based services without compromising location privacy. In: SIGMOD, pp. 301–312 (2012)

  6. Xu, J., Yi, P., Choi, B. et al.: Privacy-preserving reachability query services for massive networks. In: CIKM, pp. 145–154 (2016)

  7. Available at: https://www.oracle.com/a/tech/docs/sg-oow2019-using-graph-analysis-and-fraud-detection-in-fintech-industry.pdf

  8. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  9. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: ICDE, pp. 24 (2006)

  10. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)

  11. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)

  12. Yuan, M., Chen, L., Philip, S.Y., Yu, T.: Protecting sensitive labels in social network data anonymization. TKDE 25(3), 633–647 (2013)

    Google Scholar 

  13. Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD, pp. 93–106 (2008)

  14. Tai, C.-H., Tseng, P.-J., Philip, S.Y., Chen, M.-S.: Identity protection in sequential releases of dynamic networks. TKDE 26(3), 635–651 (2014)

    Google Scholar 

  15. Zhou, B., Pei, J.: Preserving privacy in social networks against neighborhood attacks. In: ICDE, pp. 506–515 (2008)

  16. Hay, M., Miklau, G., Jensen, D., Towsley, D., Weis, P.: Resisting structural re-identification in anonymized social networks. PVLDB 1(1), 102–114 (2008)

    Google Scholar 

  17. Zou, L., Chen, L., Özsu, M.T.: K-automorphism: a general framework for privacy preserving network publication. PVLDB 2(1), 946–957 (2009)

    Google Scholar 

  18. Cheng, J., Fu, A.W.-c., Liu, J.: K-isomorphism: privacy preserving network publication against structural attacks. In: SIGMOD, pp. 459–470 (2010)

  19. Wu, W., Xiao, Y., Wang, W., He, Z., Wang, Z.: K-symmetry model for identity anonymization in social networks. In: EDBT, pp. 111–122 (2010)

  20. Gao, J., et al.: A privacy-preserving framework for subgraph pattern matching in cloud. In: DASFAA, pp. 307–322 (2018)

  21. Barnhart, C., Johnson, E.L., Nemhauser, G.L., Savelsbergh, M.W., Vance, P.H.: Branch-and-price: column generation for solving huge integer programs. Oper. Res. 46(3), 316–329 (1998)

    Article  MathSciNet  Google Scholar 

  22. Li, X.-Y., Zhang, C., Jung, T., Qian, J., Chen, L.: Graph-based privacy-preserving data publication. In: INFOCOM, pp. 1–9 (2016)

  23. Hajian, S., Domingo-Ferrer, J., Farràs, O.: Generalization-based privacy preservation and discrimination prevention in data publishing and mining. DMKD 28(5–6), 1158–1188 (2014)

    MathSciNet  MATH  Google Scholar 

  24. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV 40(2), 99–121 (2000)

  25. Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: ICS, p. 29 (1995)

  26. He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)

  27. Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)

    Article  MathSciNet  Google Scholar 

  28. ILOG, I.: Cplex optimizer. https://www.ibm.com/cn-zh/marketplace/ibm-ilog-cplex (2012)

  29. Du, B., Zhang, S., Cao, N., Tong, H.: First: fast interactive attributed subgraph matching. In: SIGKDD. ACM, pp. 1447–1456 (2017)

  30. Qiao, M., Zhang, H., Cheng, H.: Subgraph matching: on compression and computation. PVLDB 11(2), 176–188 (2017)

    Google Scholar 

  31. Yang, Z., Fu, A.W.-C., Liu, R.: Diversified top-k subgraph querying in a large graph. In: SIGMOD, pp. 1167–1182 (2016)

  32. Han, W.-S., Lee, J., Lee, J.-H.: Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD, pp. 337–348 (2013)

  33. Zhu, G., Lin, X., Zhu, K., Zhang, W., Yu, J.X.: Treespan: efficiently computing similarity all-matching. In: SIGMOD, pp. 529–540 (2012)

  34. Hay, M., Li, C., Miklau, G., Jensen, D.: Accurate estimation of the degree distribution of private networks. In: ICDM, pp. 169–178 (2009)

  35. Karwa, V., Raskhodnikova, S., Smith, A., Yaroslavtsev, G.: Private analysis of graph structure. PVLDB 4(11), 1146–1157 (2011)

    MATH  Google Scholar 

  36. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Private release of graph statistics using ladder functions. In: SIGMOD, pp. 731–745 (2015)

  37. Ye, Q., Hu, H., Au, M.H., Meng, X., Xiao, X.: LF-GDPR:Graph metric estimation with local differential privacy. In: TKDE (2020). https://doi.org/10.1109/TKDE.2020.3047124

  38. Jiang, H., Pei, J., Yu, D. et al.: Applications of differential privacy in social network analysis: a survey. TKDE (2021)

  39. Ding, X., Sheng, S., Zhou, S. et al.: Differentially Private Triangle Counting in Large Graphs. TKDE (2021)

  40. Chen, S., Zhou, S.: Recursive mechanism: Towards node differential privacy and unrestricted joins. In: SIGMOD, pp. 653–664 (2013)

  41. Kasiviswanathan, S.P., Nissim, K., Raskhodnikova, S., Smith, A.: Analyzing graphs with node differential privacy. In: TCC, pp. 457–476 (2013)

  42. Day, W.Y., Li, N., Lyu, M.: Publishing graph degree distribution with node differential privacy. In: SIGMOD, pp. 123–138 (2016)

  43. Wang, Q., Zhang, Y., Lu, X., et al.: Real-time and spatio-temporal crowd-sourced social network data publishing with differential privacy. TDSC 15(4), 591–606 (2016)

    Google Scholar 

  44. Jorgensen, Z., Yu, T., Cormode, G.: Publishing attributed social graphs with formal privacy guarantees. In: SIGMOD, pp. 107–122 (2016)

  45. Zheleva, E., Getoor, L.: Preserving the privacy of sensitive relationships in graph data. In: International Workshop on Privacy, Security, and Trust in KDD, pp. 153–171 (2007)

  46. Campan, A., Truta, T.M.: Data and structural k-anonymity in social networks. In: International Workshop on Privacy, Security, and Trust in KDD, pp. 33–54 (2008)

  47. Bhagat, S., Cormode, G., Krishnamurthy, B., Srivastava, D.: Class-based graph anonymization for social network data. PVLDB 2(1), 766–777 (2009)

    Google Scholar 

  48. Fan, Z., Choi, B., Xu, J., Bhowmick, S.S.: Asymmetric structure-preserving subgraph queries for large graphs. In: ICDE, pp. 339–350 (2015)

  49. Gao, J., Yu, J.X., Jin, R., Zhou, J., Wang, T., Yang, D.: Neighborhood-privacy protected shortest distance computing in cloud. In: SIGMOD, pp. 409–420 (2011)

  50. Xie, D., Li, G., Yao, B., Wei, X., Xiao, X., Gao, Y., Guo, M.: Practical private shortest path computation based on oblivious storage. In: ICDE, pp. 361–372 (2016)

  51. Ma, J., Yao, B., Gao, X., et al.: Top-k critical vertices query on shortest path. TKDE 30(10), 1999–2012 (2018)

    Google Scholar 

  52. Shen, M., Ma, B., Zhu, L., et al.: Cloud-based approximate constrained shortest distance queries over encrypted graphs with privacy protection. TIFS 13(4), 940–953 (2017)

    Google Scholar 

  53. Ding, X., Wang, C., Choo, K.K.R., et al.: A novel privacy preserving framework for large scale graph data publishing. TKDE 33(2), 331–343 (2019)

    Google Scholar 

  54. Jiang, J., Yi, P., Choi, B., et al.: Privacy-preserving reachability query services for massive networks. In: CIKM, pp. 145–154 (2016)

  55. Yang, S., Tang, S., Zhang, X.: Privacy-preserving k nearest neighbor query with authentication on road networks. JPDC 134, 25–36 (2019)

    Google Scholar 

  56. Liang, H., Yuan, H.: On the complexity of t-closeness anonymization and related problems. In: DASFAA, pp. 331–345 (2013)

  57. Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB 1(1), 364–375 (2008)

    Google Scholar 

  58. Garey, M.R., Johnson, D.S.: Computers and intractability. Freeman San Francisco, vol. 174 (1979)

  59. Schrenk, S., Finke, G., Cung, V.-D.: Two classical transportation problems revisited: pure constant fixed charges and the paradox. Math. Comput. Model. 54(9–10), 2306–2315 (2011)

    Article  MathSciNet  Google Scholar 

  60. Žerovnik, J.: Heuristics for np-hard optimization problems-simpler is better!? Logist. Sustain. Transp. 6(1), 1–10 (2015)

    Article  Google Scholar 

  61. Nayak, K., Wang, X.S., Ioannidis, S., Weinsberg, U., Taft, N., Shi, E.: Graphsc: Parallel secure computation made easy. In: S&P, pp. 377–394 (2015)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos: 62072390, U1936205, U1636205, 61572413, 62072125) and the Research Grants Council, Hong Kong SAR, China (Grant Nos: 15238116, 15222118, 15218919, 15203120).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Proofs

Appendix A Proofs

In this section, we present the formal proofs of theorems and lemmas.

1.1 A.1 Proof of Theorem 1

Given a data graph G, the graph outsourcing problem with t-closeness and k-automorphism is to compute an outsourced graph \(G'\), where t-closeness for its labels is required. Since t-closeness is a known NP-Hard problem [56] and can be reduced to our graph outsourcing problem, the graph outsourcing problem with t-closeness and k-automorphism is NP-Hard. In addition, our subgraph matching problem is NP-Hard since it involves subgraph isomorphism testing, which is a classical NP-Hard problem [32, 57, 58]. Overall, both graph outsourcing problem with (k, t)-privacy and subgraph matching problem on outsourced graphs are NP-Hard.

1.2 A.2 Proof of Lemma 1

Let \(l=(l_1,l_2,\ldots ,l_n)\) be ordered labels and \((1/n,1/n,\ldots ,1/n)\) their distribution masses. We define \(\alpha \)-th Alignment Group (denoted by \(g_{\alpha }\)) as m consecutive labels in l, i.e., \(g_{\alpha }\) \(=\) \((l_{(\alpha -1)m+1}\), \(l_{(\alpha -1)m+2}\), \(\ldots \), \(l_{(\alpha -1)m+\beta },\ldots ,l_{\alpha m})\) (Fig. 21). In addition, let feasible column \(y_j\) be ordered labels \((e_1,\ldots ,e_\alpha , \ldots ,e_{n/m})\) with evenly distributed mass \((m/n,m/n,\ldots ,\) m/n). Since labels are ordered, according to [10], the minimal workload of \(EMD(l,y_j)\) can be achieved by satisfying all elements of l sequentially, i.e., sequentially move distribution masses from \(y_j\) to l. In particular, as depicted in Fig. 21, \(e_\alpha \) should transport \(\frac{1}{n}\) distribution mass to each label in \(g_{\alpha }\) \(=\) \((l_{(\alpha -1)m+1}\), \(l_{(\alpha -1)m+2}, \ldots ,l_{(\alpha -1)m+\beta }\),\(\ldots \),\(l_{\alpha m})\). In short, each element \(e_\alpha \) in \(y_j\) should be “aligned” with the \(\alpha \)-th alignment group (i.e., transport distribution mass to elements in \(\alpha \)-th alignment group), and \(e_\alpha \) should transport \(\frac{1}{n}\) distribution mass to each element of \(\alpha \)-th alignment group.

Fig. 21
figure 21

Alignment Group

1.3 A.3 Proof of Lemma 2

According to Lemma 1, each element \(e_\alpha \) in \(y_j\) is aligned with \(\alpha \)-th alignment group (i.e., \(g_\alpha \)). In addition, observe that the subscripts of elements in alignment group \(\alpha \) are \((\alpha -1)m+1, (\alpha -1)m+2,\ldots ,(\alpha -1)m+\beta ,\ldots ,(\alpha -1)m+m\), respectively, and the ground distance between \(e_\alpha \) and \(\beta \)-th element in alignment group \(\alpha \) is \(\frac{|i-(\alpha -1)m-\beta |}{n-1}\) where i is the position of \(e_\alpha \) in l. Therefore, we derive that the ground distance between \(e_\alpha \) and \(\alpha \)-th alignment group is

$$\begin{aligned} \frac{1}{n-1}\sum _{\beta =1}^{m}\Big | i-(\alpha -1)m - \beta \Big | \end{aligned}$$

where i is \(e_\alpha 's\) position in l. To estimate its domain, three cases should be considered:

  1. (1)

    If \(i \le (\alpha -1)m+1\),

    $$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1} ( (\alpha -1)m^2 + \frac{(1+m)m}{2} - im) \\&= \frac{2n^2\alpha - 2(n^2/m)i - 2n^2 + (n/m+n)n}{2(n-1)(n/m)^2}. \end{aligned} \end{aligned}$$

    \(Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - n^2/m}{2(n-1)(n/m)^2}, \frac{2n^3/m - 2n^3/m^2 - n^2 + n^2/m}{2(n-1)(n/m)^2}]\).

  2. (2)

    If \((\alpha -1)m+1 \le i \le (\alpha -1)m+m\),

    $$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1}\left( \sum _{\beta _1=1}^{\beta }(\beta -\beta _1)+\sum _{\beta _2=1}^{m-\beta }\beta _2 \right) \\&=\frac{2\beta ^2\alpha - 2(1+m)\beta + m + m^2}{2(n-1)}. \end{aligned} \end{aligned}$$

    If m is odd, \(Dist(e_\alpha ,g_\alpha )\) \(\in \) \([\frac{n^2 - n^2/m^2}{4(n-1)(n/m)^2}\), \(\frac{n^2 - n^2/m}{2(n-1)(n/m)^2}]\), otherwise, \([\frac{n^2}{4(n-1)(n/m)^2},\) \(\frac{n^2 - n^2/m}{2(n-1)(n/m)^2} ]\).

  3. (3)

    If \(i \ge (\alpha -1)m+m\),

    $$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1}( im - (\alpha -1)m^2 - \frac{(1+m)m}{2}) \\&= \frac{\frac{2n^2}{mi} - 2(\frac{n}{m})^2n - 2n^2\alpha + 2n^2 -(\frac{n}{m}+n)n}{2(n-1)(n/m)^2}, \end{aligned} \end{aligned}$$

\(Dist(e_\alpha ,g_\alpha )\) \(\in \) \([\frac{n^2 - n^2/m}{2(n-1)(n/m)^2},\) \( \frac{2n^3/m - 2n^3/m^2 - n^2 + n^2/m}{2(n-1)(n/m)^2}]\).

Therefore, for \(\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]\), \(Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\) (if m is odd) or \(Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\) (if m is even).

1.4 A.4 Proof of Theorem 2

Lemma 2 proved that \(\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]\), if m is odd, \(Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\). Otherwise, \(Dist(e_\alpha \), \(g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\). Each element \(e_\alpha \) of \(y_j\) generated by initial solution is selected from the i-th position of l, where \(i \in [(\alpha -1)m+1,(\alpha -1)m+m ]\). In addition, Lemma 1 showed that each element \(e_\alpha \) in \(y_j\) is supposed to transport 1/n distribution mass to each element in \(\alpha -\)th alignment group. Based on those two observations, we derive that \(EMD(l,y_j) \le \sum _{\alpha =1}^{n/m} \frac{n^2 - n^2/m}{2(n-1)(n/m)^2}\times \frac{1}{n}\) \(= \frac{mn^2-n^2}{2(n-1)n^2} = \frac{m-1}{2(n-1)}\). Therefore, when \(t \ge \frac{m-1}{2(n-1)}\), the column \(y_j\) satisfies \(t-\)closeness. Similarly, each column generated in subproblem also satisfies \(t-\)closeness. By the way, we can adopt the similar way to prove that the EMD between l and any other column is bounded by \(\frac{2mn-2n-m^2+m}{2(n-1)m}\).

1.5 A.5 Proof of Lemma 3

Let l=\((l_1,l_2,\ldots ,l_n)\) be n labels ordered by their values, and a the Euler–Mascheroni constant (\(\approx \) \( \frac{1}{ln(n)+0.5772+1/2n}\)), the frequency of l can be represented by \((\frac{a}{1}, \frac{a}{2},\ldots , \frac{a}{n})\), since the frequencies of labels roughly obey the Zipf’s law [3]. When \(t \ge \frac{m-1}{2(n-1)}\), sub-optimal TOGGLE generates the initial partition \(\{y_j|j\in [1,m]\}\) where \(y_{i,j}=1\) if i locates in \(\{j,j+m,\ldots ,j+(n/m-1)m\}\) or \(y_{i,j}=0\), otherwise. Let the sum of label frequencies of \(y_j\) be

$$\begin{aligned} s_j = \frac{a}{j}+ \frac{a}{j+m}+\ldots + \frac{a}{j+(n/m-1)m}, \end{aligned}$$

the cost of \(y_j\) is obviously \(s^2_j\). Therefore, the total cost of the initial solution is \(s^2_1+s^2_2+\ldots +s^2_m\) where \(s_1+s_2+\ldots +s_m =1\) and \(s_1>\) \(s_2>\) \(\ldots \) \(>s_m\). Due to

$$\begin{aligned} \begin{aligned} s_1&= \frac{a}{1}+ \frac{a}{1+m}+\ldots + \frac{a}{1+(n/m-1)m} \\&\le \frac{a}{1}+ \frac{a}{m}+\ldots + \frac{a}{(n/m-1)m} \\&\le \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}, \end{aligned} \end{aligned}$$

we can derive that \(\frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m} \ge s_1>s_2>\ldots >s_m\), and

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{m}s_i^2&= \left( \sum _{i}^{m}{s_i} \right) ^2-s_1\left( \sum _{i \ne 1}^{m}s_i \right) -s_2\left( \sum _{i \ne 2}^{m}s_i \right) -\ldots -s_m\left( \sum _{i \ne m}^{m}s_i \right) \\&\le \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}. \end{aligned} \end{aligned}$$

Therefore, the model cost of initial solution under \(t-\)closeness constraints is at most \( \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}\). To estimate the approximate ratio \(R_1\) of our model cost to the exact model cost, we first relax the \(t-\)closeness constraint to find the minimum model cost. Formally, for any \(\{X_i\}\) subjecting to \(\sum _{i=1}^{m}X_i=1\), we need to estimate the lower bound of \(\sum _{i=1}^{m}X_i^2\). According to CauchySchwarz inequality, for \(X_i,Y_i \in {\mathcal {R}}\), \(\big (\sum _{i=1}^{m}X_iY_i\big )^2 \le \big (\sum _{i=1}^{m}X_i^2\big )\big (\sum _{i=1}^{m}Y_i^2\big )\). Let \(Y_i=1\), we derive that \(\frac{1}{m} \le \sum _{i=1}^{m}X_i^2\). Therefore, the minimum model cost, \(s^2_1+s^2_2+\ldots +s^2_m\), is no less than \(\frac{1}{m}\). The approximate ratio \(R_1 \le \frac{ 1/m + (m^2-m+2)a/(m^2+m)}{1/m} \le 1 + \frac{(m^2-m+2)a}{m+1}\). The approximation is good since the approximate ratio is approximately liner to \(m \cdot a\).

1.6 A.6 Proof of Lemma 4

From Lemma 3, we observe that the sum of labels frequencies of \(y_j\) is \(s_j = \frac{a}{j}+ \frac{a}{j+m}+\ldots + \frac{a}{j+(n/m-1)m}\) where \(s_1+s_2+\ldots +s_m =1\) and \(s_1>\) \(s_2>\) \(\ldots \) \(>s_m\). If we denote the first dual solution of the master problem as \(\mu = [s^2_1,s^2_2,\ldots ,s^2_m,0,0,\ldots ,0]\), the objective values of the original subproblem and the reduced problem can be formulated as \(J_2=min(c(y_j)-\mu y_j),~s.t., ~EMD(l,y_j)\le t\) and \( J_{2}' = min(c(y_j)-\mu y_j)\), s.t., QKP Constraints, respectively. Intuitively, we can derive that

$$ \begin{aligned} \frac{J_2}{J_{2}\prime }&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j}- (\sum _{i=1}^{n} \frac{ay_{i,j}}{i})^2 }{ s^2_1 - (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})^2 } \\&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ s^2_1 - (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})^2 } \le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ 2s_1 \times (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})} \\&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ 2s_1 \times s_m } \le \frac{\sum _{i=1}^{n/m}s^2_i }{ 2s_1 \times s_m } = \frac{\sum _{i=1}^{n/m}s_i\times s_1 }{ 2s_1 \times s_m } \\&= \frac{1}{2} \big ( \frac{s_1}{s_1}\frac{s_1}{s_m} + \frac{s_2}{s_1}\frac{s_2}{s_m} +\ldots + \frac{s_{n/m}}{s_1}\frac{s_{n/m}}{s_m}\big )\\&\le \frac{1}{2} \big ( \frac{s_1}{s_1}\frac{s_1}{s_m} + \frac{s_2}{s_1}\frac{s_1}{s_m} +\ldots + \frac{s_{n/m}}{s_1}\frac{s_1}{s_m}\big )\\&\le \frac{1}{2} \big ( \frac{s_1}{s_1} + \frac{s_2}{s_1} +\ldots + \frac{s_{n/m}}{s_1}\big ) \frac{s_1}{s_m} \\&\le \frac{1}{2} \big ( \frac{s_1+s_2+\ldots +s_{n/m}}{s_1}\big ) \frac{s_1}{s_m} \le \frac{1}{2} \frac{1}{s_1}\frac{s_1}{s_m} \le \frac{m}{4a}. \end{aligned} $$

Therefore, the approximate ratio of \( J_{2}'\) to \(J_2\) is no less than 4a/m where a is the Euler–Mascheroni constant.

1.7 A.7 Proof of Theorem 3

Let the optimal solution to original problem be opt, and the initial solution \(R_1\times \) opt, if the first reduced cost of the column generation method is \(J_2\), then \(\textsc {opt}= R_1\times \textsc {opt} - \gamma \times J_2\) where \(\gamma \ge 1\) and \(\gamma = (R_1-1)\) opt\(/J_2\). Similarly, if the first reduced cost of the sub-optimal method is \( J_{2}'\), we can derive the objective value \(X = R_1\times \textsc {opt} - \gamma ' \times J_{2}'\).

On the basis of those two lemmas, we can prove that if \( \gamma \le \gamma '\), the approximate ratio is \(\textsc {opt}/X \ge \textsc {opt}/(R_1\times \textsc {opt} - \gamma \times J_{2}') = \textsc {opt}/(R_1\times \textsc {opt} - ((R_1-1)\textsc {opt}/J_2) \times J_{2}') \ge 1/(1+ (m^3-5m^2+6m-8)a/(m^2+m))\). Otherwise, \(\textsc {opt}/X = \textsc {opt}/(R_1\times \textsc {opt} - \gamma ' \times J_{2}')\ge \textsc {opt}/(R_1\times \textsc {opt}) \ge 1/(1 + (m^2-m+2)a/(m+1))\). Therefore, \(\textsc {opt} \le X \le (1+ (m^3-5m^2+6m-8)a/(m^2+m))\textsc {opt}\) or \((1 + (m^2-m+2)a/(m+1))\textsc {opt} \approx (1+0.2m)\textsc {opt}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, K., Hu, H., Zhou, S. et al. Privacy and efficiency guaranteed social subgraph matching. The VLDB Journal 31, 581–602 (2022). https://doi.org/10.1007/s00778-021-00706-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00706-0

Keywords

Navigation