Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3448016.3452828acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Dynamic Structural Clustering on Graphs

Published: 18 June 2021 Publication History

Abstract

\em Structural Clustering ($\strclu$) is one of the most popular graph clustering paradigms. In this paper, we consider $\strclu$ under Jaccard similarity on a dynamic graph, G = (V, E), subject to edge insertions and deletions (updates). The goal is to maintain certain information under updates, so that the strclu clustering result on~G can be retrieved in O(|V| + |E|)$ time, upon request. The state-of-the-art worst-case cost is~O(|V|) per update; we improve this update-time bound \em significantly with the ρ-approximate notion. Specifically, for a specified failure probability, δ^*, and \em every sequence of~M updates (no need to know M's value in advance), our algorithm, $\dynelm$, achieves~O(?og^2 |V| + og |V| \cdot ?og \fracM ?^* )$ amortized cost for each update, \em at all times in linear space. Moreover, $\dynelm$ provides a provable "sandwich'' guarantee on the clustering quality at all times after each update with probability at least 1 - ^*. We further develop dynelm into our ultimate algorithm, dynstr, which also supports \em cluster-group-by queries. Given Q \subseteq V, this puts the non-empty intersection of Q and each strclu cluster into a distinct group. dynstr not only achieves all the guarantees of dynelm, but also runs \em cluster-group-by queries in~O(|Q|\cdot og |V|) time. We demonstrate the performance of our algorithms via extensive experiments, on 15 real datasets. Experimental results confirm that our algorithms are up to three orders of magnitude more efficient than state-of-the-art competitors, and still provide quality structural clustering results.

Supplementary Material

MP4 File (3448016.3452828.mp4)
Structural Clustering (StrClu) is one of the most popular graph clustering paradigms. In this paper, we consider StrClu under the Jaccard similarity on a dynamic graph, G = < V, E >, subject to edge insertions and deletions. The goal is to maintain certain information under updates, so that the StrClu clustering result on G can be retrieved in O(|V| + |E|) time, upon request. The state-of-the-art worst-case update cost is O(|V|). We improve this bound significantly with the \rho-approximate notion. Specifically, for any sequence of updates that satisfies: for every vertex u in V, the frequency of deletions incident on u is at most a constant fraction of that of the insertions incident on u, our algorithm, DynELM, achieves ~O(1) amortized cost for each update, with linear space consumption, where the notation ~O( ) hides a poly-logarithmic factor in the complexity. Moreover, DynELM provides a provable "sandwich" guarantee on the clustering quality with high probability. We further develop DynELM into our ultimate algorithm, DynStrClu, which also supports cluster-group-by queries. Given an arbitrary subset Q of V, this puts the non-empty intersection of Q and each StrClu cluster into a distinct group. DynStrClu not only achieves all the guarantees of DynELM, but also runs cluster-group-by queries in ~O(|Q|) time. We demonstrate the performance of our algorithms via extensive experiments, on 14 real datasets. Experimental results confirm that both our algorithms are up to three orders of magnitude more efficient than state-of-the-art competitors, and still provide quality structural clustering results.

References

[1]
Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation Clustering. Machine Learning, Vol. 56, 1--3 (2004), 89--113.
[2]
James C Bezdek and Nikhil R Pal. 1998. Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 28, 3 (1998), 301--315.
[3]
Lijun Chang, Wei Li, Lu Qin, Wenjie Zhang, and Shiyu Yang. 2017. pSCAN: Fast and Exact Structural Graph Clustering. IEEE Trans. Knowl. Data Eng., Vol. 29, 2 (2017), 387--401.
[4]
Sudarshan S Chawathe. 2019. Clustering blockchain data. In Clustering Methods for Big Data Analytics. Springer, 43--72.
[5]
Graham Cormode, S. Muthukrishnan, and Ke Yi. 2011. Algorithms for Distributed Functional Monitoring. ACM Trans. Algorithms, Vol. 7, 2, Article 21 (March 2011), bibinfonumpages21:1--21:20 pages.
[6]
Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon. 2001. A Min-max Cut Algorithm for Graph Partitioning and Data Clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA. 107--114.
[7]
Yijun Ding, Minjun Chen, Zhichao Liu, Don Ding, Yanbin Ye, Min Zhang, Reagan Kelly, Li Guo, Zhenqiang Su, Stephen C Harris, et almbox. 2012. atBioNet--an integrated network analysis tool for genomics and biomarker discovery. BMC genomics, Vol. 13, 1 (2012), 325.
[8]
Junhao Gan, David F. Gleich, Nate Veldt, Anthony Wirth, and Xin Zhang. 2019. Graph Clustering in All Parameter Regimes. CoRR, Vol. abs/1910.06435 (2019). http://arxiv.org/abs/1910.06435
[9]
Junhao Gan and Yufei Tao. 2017a. Dynamic Density Based Clustering. In Proceedings of ACM Management of Data (SIGMOD) . 1493--1507.
[10]
Junhao Gan and Yufei Tao. 2017b. On the Hardness and Approximation of Euclidean DBSCAN . ACM Trans. Database Syst., Vol. 42, 3 (2017), 14:1--14:45.
[11]
Junhao Gan and Yufei Tao. 2018. Fast Euclidean OPTICS with Bounded Precision in Low Dimensional Space. In Proceedings of ACM Management of Data (SIGMOD). 1067--1082.
[12]
David F. Gleich, Nate Veldt, and Anthony Wirth. 2018. Correlation Clustering Generalized. In 29th International Symposium on Algorithms and Computation, ISAAC 2018, December 16--19, 2018, Jiaoxi, Yilan, Taiwan . 44:1--44:13.
[13]
Stephan Gü nnemann, Ines F"a rber, Brigitte Boden, and Thomas Seidl. 2014. GAMer: a synthesis of subspace clustering and dense subgraph mining. Knowl. Inf. Syst., Vol. 40, 2 (2014), 243--278.
[14]
Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding. Springer, 409--426.
[15]
Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. 2001. Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. J. ACM, Vol. 48, 4 (2001), 723--760.
[16]
Zengfeng Huang, Ke Yi, and Qin Zhang. 2019. Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks. Algorithmica, Vol. 81, 6 (2019), 2222--2243.
[17]
Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, Vol. 2, 1 (1985), 193--218.
[18]
Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27--29, 2006. 289--300.
[19]
Ludmila I Kuncheva, Stefan Todorov Hadjitodorov, and Ludmila P Todorova. 2006. Experimental comparison of cluster ensemble methods. In 2006 9th International Conference on Information Fusion. IEEE, 1--7.
[20]
Ludmila I Kuncheva and Dmitry P Vetrov. 2006. Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE transactions on pattern analysis and machine intelligence, Vol. 28, 11 (2006), 1798--1808.
[21]
Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, and Jae-Gil Lee. 2014. LinkSCAN*: Overlapping community detection using the link-space transformation. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014. 292--303.
[22]
Mutlu Mete, Fusheng Tang, Xiaowei Xu, and Nurcan Yuruk. 2008. A structural approach for finding functional modules from large biological networks. In Bmc Bioinformatics, Vol. 9. Springer, S19.
[23]
Glenn W Milligan and Martha C Cooper. 1986. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate behavioral research, Vol. 21, 4 (1986), 441--458.
[24]
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, Vol. 52, 1--2 (2003), 91--118.
[25]
Flavia Moser, Recep Colak, Arash Rafiey, and Martin Ester. 2009. Mining Cohesive Patterns from Graphs with Feature Vectors. In Proceedings of the SIAM International Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks, Nevada, USA. 593--604.
[26]
Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek. 2005. Uncovering the overlapping community structure of complex networks in nature and society. nature, Vol. 435, 7043 (2005), 814.
[27]
Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali, and Ploutarchos Spyridonos. 2012. Community detection in social media. Data Mining and Knowledge Discovery, Vol. 24, 3 (2012), 515--554.
[28]
S Papadopoulos, C Zigkolis, Y Kompatsiaris, and A Vakali. 2010. Cluster-based landmark and event detection on tagged photo collections. IEEE, Vol. 99 (2010), 1--1.
[29]
Vishnu Sankar, Balaraman Ravindran, and S. Shivashankar. 2015. CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25--31, 2015. 2097--2103.
[30]
Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review, Vol. 1, 1 (2007), 27--64.
[31]
Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka. 2015. SCAN
[32]
: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs. PVLDB, Vol. 8, 11 (2015), 1178--1189.
[33]
Mikkel Thorup. 2000. Near-optimal fully-dynamic graph connectivity. In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May 21--23, 2000, Portland, OR, USA. 343--350.
[34]
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, Vol. 11 (2010), 2837--2854.
[35]
Lu Wang, Yanghua Xiao, Bin Shao, and Haixun Wang. 2014. How to partition a billion-node graph. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014 . 568--579.
[36]
Dong Wen, Lu Qin, Ying Zhang, Lijun Chang, and Xuemin Lin. 2019. Efficient structural graph clustering: an index-based approach. VLDB J., Vol. 28, 3 (2019), 377--399.
[37]
Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. 2007. SCAN: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12--15, 2007. 824--833.
[38]
Nurcan Yuruk, Xiaowei Xu, and Thomas AJ Schweiger. 2008a. On Structural Analysis of Large Networks. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008). IEEE, 143--143.
[39]
Nurcan Yuruk, Xiaowei Xu, and Thomas A. J. Schweiger. 2008b. On Structural Analysis of Large Networks. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008). 143.

Cited By

View all
  • (2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
  • (2024)Denoising High-Order Graph Clustering2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00241(3111-3124)Online publication date: 13-May-2024
  • (2024)A Novel Method for Vertex Clustering in Dynamic NetworksComplex Networks & Their Applications XII10.1007/978-3-031-53499-7_36(445-456)Online publication date: 29-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithms
  2. dynamic graphs
  3. structural clustering

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)11
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
  • (2024)Denoising High-Order Graph Clustering2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00241(3111-3124)Online publication date: 13-May-2024
  • (2024)A Novel Method for Vertex Clustering in Dynamic NetworksComplex Networks & Their Applications XII10.1007/978-3-031-53499-7_36(445-456)Online publication date: 29-Feb-2024
  • (2023)Scaling Up Structural Clustering to Large Probabilistic Graphs Using Lyapunov Central Limit TheoremProceedings of the VLDB Endowment10.14778/3611479.361151616:11(3165-3177)Online publication date: 1-Jul-2023
  • (2023)Scalable Approximate Butterfly and Bi-triangle Counting for Large Bipartite NetworksProceedings of the ACM on Management of Data10.1145/36267531:4(1-26)Online publication date: 12-Dec-2023
  • (2023)An Efficient Algorithm for Distance-based Structural Graph ClusteringProceedings of the ACM on Management of Data10.1145/35887251:1(1-25)Online publication date: 30-May-2023
  • (2023)Datasets, tasks, and training methods for large-scale hypergraph learningData Mining and Knowledge Discovery10.1007/s10618-023-00952-637:6(2216-2254)Online publication date: 26-Jul-2023
  • (2022)Effective indexing for dynamic structural graph clusteringProceedings of the VLDB Endowment10.14778/3551793.355184015:11(2908-2920)Online publication date: 1-Jul-2022
  • (2022)Approximate Range ThresholdingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526123(1108-1121)Online publication date: 10-Jun-2022
  • (2022)Index-based Structural Clustering on Directed Graphs2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00257(2831-2844)Online publication date: May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media