Efficiently detecting overlapping communities using seeding and semi-supervised learning

Changxing Shang^1,2,
Shengzhong Feng¹,
Zhongying Zhao¹ &
…
Jianping Fan¹

547 Accesses
2 Altmetric
Explore all metrics

Abstract

A common scheme for discovering overlapping communities in a network is to use a seeding process followed by an expansion process. Most seeding methods are either too complex to scale to large networks or too simple to select high-quality seeds. Additionally, the non-principled functions used by most expansion methods lead to poor performances when applied to diverse networks. This paper proposes a new method that transforms a network into a corpus. Each edge is treated as a document, and all the network nodes are treated as terms of the corpus. We propose an effective seeding method that selects seeds as a training set, and a principled expansion method based on semi-supervised learning that classifies the edges. We compared our new algorithm with four other community detection algorithms on a wide range of synthetic and empirical networks. Our experimental results show that the new algorithm significantly improved the clustering performance in most cases. Furthermore, the time complexity of the new algorithm is linear with respect to the number of edges, which means that the technique can be scaled to large networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised Overlapping Community Finding Based on Label Propagation with Pairwise Constraints

Pure expansion-based local community detection

Article 29 July 2024

How to Detect Communities in Large Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Borgatti [13] proposed the Jaccard similarity matrix, but the Jaccard matrix in this paper is different from Borgatti’s Jaccard similarity matrix. Using the definitions given by Borgatti [13], the Jaccard matrix is a 2-mode matrix derived from a 1-mode adjacent matrix. Borgatti’s Jaccard similarity matrix is a 1-mode matrix derived from a 2-mode women-by-events matrix.
https://github.com/cxshang/ITEM.
All experiments were conducted on a workstation that has eight Intel Xeon 2.27GHz cores and 12G RAM. ITEM and SMC were parallelized running on eight cores using openMP. The other algorithms were not parallelized and ran on one core.
http://snap.stanford.edu/data/index.html.

References

Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821
Article MathSciNet MATH Google Scholar
Borgs C, Chayes J, Mahdian M, Saberi A (2004) Exploring the community structure of newsgroups. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 783–787, 2004
Lee C, Reid F, McDaid A, Hurley N (2010) Detecting highly overlapping community structure by greedy clique expansion. In: SNA-KDD’10: Proceedings of the 4th Workshop on Social Network Mining and Analysis, 2010
Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015
Article Google Scholar
Lee C, Reid F, McDaid A, Hurley N (2011) Seeding for pervasively overlapping communities. Phys Rev E 83(6):066107
Article Google Scholar
Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117
Article Google Scholar
Newman ME, Leicht EA (2007) Mixture models and exploratory analysis in networks. Proc Natl Acad Sci 104(23):9564
Article MATH Google Scholar
Shen H, Cheng X, Cai K, Hu MB (2009) Detect overlapping and hierarchical community structure in networks. Phys A Stat Mech Appl 388(8):1706
Article Google Scholar
Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575
Article MATH Google Scholar
Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. IADIS AC 5:97
Google Scholar
Baumes J, Goldberg M, Magdon-Ismail M (2005) Intelligence and security informatics. Springer, New York, pp 27–36
Yang J, Leskovec J (2012) Proceedings of the ACM SIGKDD Workshop on mining data semantics, ACM, p 3, 2012
Borgatti SP (2012) Computational complexity—theory, techniques, and applications. In: Meyers RA (ed). Springer, New York, pp 2912–2924
Berry MW, Castellanos M (2004) Survey of text mining. Springer, New York
Koller D, Sahami M (1997) Proceedings of ICML-97, 14th International Conference on machine learning. Morgan Kaufmann Publishers, Burlington, pp 170–178
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503
Article Google Scholar
Zhu X (2006) Semi-supervised learning literature survey. Comput Sci Univ Wis Madison 2:3
Google Scholar
Jiang J, Yan X, Yu Z, Guo J, Tian W (2014) A Chinese expert disambiguation method based on semi-supervised graph clustering. Int J Mac Learn Cybern :1–8 (2014)
Maulik U, Chakraborty D (2012) A novel semisupervised SVM for pixel classification of remote sensing imagery. Int J Mac Learn Cybern 3(3):247
Article Google Scholar
Chen WJ, Shao YH, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mac Learn Cybern 5(3):459
Article Google Scholar
Tanha J, van Someren M, Afsarmanesh H (2015) Semi-supervised self-training for decision tree classifiers. Int J Mac Learn Cybern :1–16
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103
Article MATH Google Scholar
Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl Based Syst 54:298
Article Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM, pp 380–388, 2002
Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 141–150, 2007
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 89–98, 2003
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. ICML 99:258–267
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110
Article Google Scholar
Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118
Article Google Scholar
Traud AL, Kelsic ED, Mucha PJ, Porter MA (2011) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):526
Article MathSciNet Google Scholar
Traud AL, Mucha PJ, Porter MA (2012) Social structure of Facebook networks. Phys A Stat Mech Appl 391(16):4165
Article Google Scholar
Lee C, Cunningham P (2014) Community detection: effective evaluation on large social networks. J Comp Netw 2(1):19
Article Google Scholar
Gargi U, Lu W, Mirrokni VS, Yoon S (2011) Large-Scale Community Detection on YouTube for Topic Discovery and Exploration. ICWSM
Subbian K, Aggarwal CC, Srivastava J, Yu PS (2013) Community Detection with Prior Knowledge. In: Proceedings of the 2013 SIAM International Conference on data mining, SIAM, pp 405–413, 2013
Yang T, Jin R, Chi Y, Zhu S (2009) Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 927–936, 2009
Gopalan PK, Blei DM (2013) Efficient discovery of overlapping communities in massive networks. Proc Natl Acad Sci 110(36):14534
Article MathSciNet MATH Google Scholar
Andersen R, Gleich DF, Mirrokni V (2012) Overlapping clusters for distributed computation. In: Proceedings of the fifth ACM international conference on Web search and data mining, ACM, pp 273–282, 2012
Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM , pp 597–605, 2012
Xie J, Kelley S, Szymanski EK (2013) Overlapping Community Detection in Networks: The State-of-the-art and Comparative Study. ACM Comput Surv 45(4):43. doi:10.1145/2501654.2501657
Xie J, Szymanski BK, Liu X (2011) Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, pp 344–349, 2011
Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103018
Article Google Scholar
Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PloS One 6(4):e18961
Article Google Scholar
Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814
Article Google Scholar
Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84(3):036103
Article Google Scholar
Chapelle O, Schölkopf B, Zien A (2006) Risks of semi-supervisedl earning: how unlabeled data can degrade performance of generative classifiers, in semi-supervised learning. MIT Press, Massachusetts , pp 57–72
Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761
Article Google Scholar
Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. Data mining, 2002, ICDM 2003. Proceedings. 2002 IEEE International Conference on, IEEE, pp 139–146, 2002
Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167
Article MathSciNet MATH Google Scholar
Stoffel K, Belkoniene A (1999) Parallel K/h-Means Clustering for Large Data Sets. In: Proceedings of the 5th International Euro-Par Conference on parallel processing. Springer, New York, pp 1451–1454, (Euro-Par ’99)

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their comments. This research is partially supported by grants from the National Natural Science Foundation of China (No. 61303167 and No. 61363047) and by funds from the Basic Research Program of Shenzhen (Grant No. JCYJ20130401170306838).

Author information

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Changxing Shang, Shengzhong Feng, Zhongying Zhao & Jianping Fan
University of Chinese Academy of Sciences, Beijing, 100080, China
Changxing Shang

Authors

Changxing Shang
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhongying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changxing Shang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shang, C., Feng, S., Zhao, Z. et al. Efficiently detecting overlapping communities using seeding and semi-supervised learning. Int. J. Mach. Learn. & Cyber. 8, 455–468 (2017). https://doi.org/10.1007/s13042-015-0338-5

Download citation

Received: 28 January 2015
Accepted: 06 February 2015
Published: 22 February 2015
Issue Date: April 2017
DOI: https://doi.org/10.1007/s13042-015-0338-5

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Overlapping Community Finding Based on Label Propagation with Pairwise Constraints

Pure expansion-based local community detection

How to Detect Communities in Large Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Efficiently detecting overlapping communities using seeding and semi-supervised learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Overlapping Community Finding Based on Label Propagation with Pairwise Constraints

Pure expansion-based local community detection

How to Detect Communities in Large Networks

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation