Abstract
A common scheme for discovering overlapping communities in a network is to use a seeding process followed by an expansion process. Most seeding methods are either too complex to scale to large networks or too simple to select high-quality seeds. Additionally, the non-principled functions used by most expansion methods lead to poor performances when applied to diverse networks. This paper proposes a new method that transforms a network into a corpus. Each edge is treated as a document, and all the network nodes are treated as terms of the corpus. We propose an effective seeding method that selects seeds as a training set, and a principled expansion method based on semi-supervised learning that classifies the edges. We compared our new algorithm with four other community detection algorithms on a wide range of synthetic and empirical networks. Our experimental results show that the new algorithm significantly improved the clustering performance in most cases. Furthermore, the time complexity of the new algorithm is linear with respect to the number of edges, which means that the technique can be scaled to large networks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Borgatti [13] proposed the Jaccard similarity matrix, but the Jaccard matrix in this paper is different from Borgatti’s Jaccard similarity matrix. Using the definitions given by Borgatti [13], the Jaccard matrix is a 2-mode matrix derived from a 1-mode adjacent matrix. Borgatti’s Jaccard similarity matrix is a 1-mode matrix derived from a 2-mode women-by-events matrix.
All experiments were conducted on a workstation that has eight Intel Xeon 2.27GHz cores and 12G RAM. ITEM and SMC were parallelized running on eight cores using openMP. The other algorithms were not parallelized and ran on one core.
References
Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821
Borgs C, Chayes J, Mahdian M, Saberi A (2004) Exploring the community structure of newsgroups. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 783–787, 2004
Lee C, Reid F, McDaid A, Hurley N (2010) Detecting highly overlapping community structure by greedy clique expansion. In: SNA-KDD’10: Proceedings of the 4th Workshop on Social Network Mining and Analysis, 2010
Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015
Lee C, Reid F, McDaid A, Hurley N (2011) Seeding for pervasively overlapping communities. Phys Rev E 83(6):066107
Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117
Newman ME, Leicht EA (2007) Mixture models and exploratory analysis in networks. Proc Natl Acad Sci 104(23):9564
Shen H, Cheng X, Cai K, Hu MB (2009) Detect overlapping and hierarchical community structure in networks. Phys A Stat Mech Appl 388(8):1706
Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575
Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. IADIS AC 5:97
Baumes J, Goldberg M, Magdon-Ismail M (2005) Intelligence and security informatics. Springer, New York, pp 27–36
Yang J, Leskovec J (2012) Proceedings of the ACM SIGKDD Workshop on mining data semantics, ACM, p 3, 2012
Borgatti SP (2012) Computational complexity—theory, techniques, and applications. In: Meyers RA (ed). Springer, New York, pp 2912–2924
Berry MW, Castellanos M (2004) Survey of text mining. Springer, New York
Koller D, Sahami M (1997) Proceedings of ICML-97, 14th International Conference on machine learning. Morgan Kaufmann Publishers, Burlington, pp 170–178
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503
Zhu X (2006) Semi-supervised learning literature survey. Comput Sci Univ Wis Madison 2:3
Jiang J, Yan X, Yu Z, Guo J, Tian W (2014) A Chinese expert disambiguation method based on semi-supervised graph clustering. Int J Mac Learn Cybern :1–8 (2014)
Maulik U, Chakraborty D (2012) A novel semisupervised SVM for pixel classification of remote sensing imagery. Int J Mac Learn Cybern 3(3):247
Chen WJ, Shao YH, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mac Learn Cybern 5(3):459
Tanha J, van Someren M, Afsarmanesh H (2015) Semi-supervised self-training for decision tree classifiers. Int J Mac Learn Cybern :1–16
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103
Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl Based Syst 54:298
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM, pp 380–388, 2002
Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 141–150, 2007
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 89–98, 2003
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. ICML 99:258–267
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110
Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118
Traud AL, Kelsic ED, Mucha PJ, Porter MA (2011) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):526
Traud AL, Mucha PJ, Porter MA (2012) Social structure of Facebook networks. Phys A Stat Mech Appl 391(16):4165
Lee C, Cunningham P (2014) Community detection: effective evaluation on large social networks. J Comp Netw 2(1):19
Gargi U, Lu W, Mirrokni VS, Yoon S (2011) Large-Scale Community Detection on YouTube for Topic Discovery and Exploration. ICWSM
Subbian K, Aggarwal CC, Srivastava J, Yu PS (2013) Community Detection with Prior Knowledge. In: Proceedings of the 2013 SIAM International Conference on data mining, SIAM, pp 405–413, 2013
Yang T, Jin R, Chi Y, Zhu S (2009) Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 927–936, 2009
Gopalan PK, Blei DM (2013) Efficient discovery of overlapping communities in massive networks. Proc Natl Acad Sci 110(36):14534
Andersen R, Gleich DF, Mirrokni V (2012) Overlapping clusters for distributed computation. In: Proceedings of the fifth ACM international conference on Web search and data mining, ACM, pp 273–282, 2012
Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM , pp 597–605, 2012
Xie J, Kelley S, Szymanski EK (2013) Overlapping Community Detection in Networks: The State-of-the-art and Comparative Study. ACM Comput Surv 45(4):43. doi:10.1145/2501654.2501657
Xie J, Szymanski BK, Liu X (2011) Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, pp 344–349, 2011
Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103018
Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PloS One 6(4):e18961
Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814
Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84(3):036103
Chapelle O, Schölkopf B, Zien A (2006) Risks of semi-supervisedl earning: how unlabeled data can degrade performance of generative classifiers, in semi-supervised learning. MIT Press, Massachusetts , pp 57–72
Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761
Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. Data mining, 2002, ICDM 2003. Proceedings. 2002 IEEE International Conference on, IEEE, pp 139–146, 2002
Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167
Stoffel K, Belkoniene A (1999) Parallel K/h-Means Clustering for Large Data Sets. In: Proceedings of the 5th International Euro-Par Conference on parallel processing. Springer, New York, pp 1451–1454, (Euro-Par ’99)
Acknowledgments
The authors would like to thank the anonymous referees for their comments. This research is partially supported by grants from the National Natural Science Foundation of China (No. 61303167 and No. 61363047) and by funds from the Basic Research Program of Shenzhen (Grant No. JCYJ20130401170306838).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shang, C., Feng, S., Zhao, Z. et al. Efficiently detecting overlapping communities using seeding and semi-supervised learning. Int. J. Mach. Learn. & Cyber. 8, 455–468 (2017). https://doi.org/10.1007/s13042-015-0338-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-015-0338-5