Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Efficiently detecting overlapping communities using seeding and semi-supervised learning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

A common scheme for discovering overlapping communities in a network is to use a seeding process followed by an expansion process. Most seeding methods are either too complex to scale to large networks or too simple to select high-quality seeds. Additionally, the non-principled functions used by most expansion methods lead to poor performances when applied to diverse networks. This paper proposes a new method that transforms a network into a corpus. Each edge is treated as a document, and all the network nodes are treated as terms of the corpus. We propose an effective seeding method that selects seeds as a training set, and a principled expansion method based on semi-supervised learning that classifies the edges. We compared our new algorithm with four other community detection algorithms on a wide range of synthetic and empirical networks. Our experimental results show that the new algorithm significantly improved the clustering performance in most cases. Furthermore, the time complexity of the new algorithm is linear with respect to the number of edges, which means that the technique can be scaled to large networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Borgatti [13] proposed the Jaccard similarity matrix, but the Jaccard matrix in this paper is different from Borgatti’s Jaccard similarity matrix. Using the definitions given by Borgatti [13], the Jaccard matrix is a 2-mode matrix derived from a 1-mode adjacent matrix. Borgatti’s Jaccard similarity matrix is a 1-mode matrix derived from a 2-mode women-by-events matrix.

  2. https://github.com/cxshang/ITEM.

  3. All experiments were conducted on a workstation that has eight Intel Xeon 2.27GHz cores and 12G RAM. ITEM and SMC were parallelized running on eight cores using openMP. The other algorithms were not parallelized and ran on one core.

  4. http://snap.stanford.edu/data/index.html.

References

  1. Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821

    Article  MathSciNet  MATH  Google Scholar 

  2. Borgs C, Chayes J, Mahdian M, Saberi A (2004) Exploring the community structure of newsgroups. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 783–787, 2004

  3. Lee C, Reid F, McDaid A, Hurley N (2010) Detecting highly overlapping community structure by greedy clique expansion. In: SNA-KDD’10: Proceedings of the 4th Workshop on Social Network Mining and Analysis, 2010

  4. Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015

    Article  Google Scholar 

  5. Lee C, Reid F, McDaid A, Hurley N (2011) Seeding for pervasively overlapping communities. Phys Rev E 83(6):066107

    Article  Google Scholar 

  6. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117

    Article  Google Scholar 

  7. Newman ME, Leicht EA (2007) Mixture models and exploratory analysis in networks. Proc Natl Acad Sci 104(23):9564

    Article  MATH  Google Scholar 

  8. Shen H, Cheng X, Cai K, Hu MB (2009) Detect overlapping and hierarchical community structure in networks. Phys A Stat Mech Appl 388(8):1706

    Article  Google Scholar 

  9. Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575

    Article  MATH  Google Scholar 

  10. Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. IADIS AC 5:97

    Google Scholar 

  11. Baumes J, Goldberg M, Magdon-Ismail M (2005) Intelligence and security informatics. Springer, New York, pp 27–36

  12. Yang J, Leskovec J (2012) Proceedings of the ACM SIGKDD Workshop on mining data semantics, ACM, p 3, 2012

  13. Borgatti SP (2012) Computational complexity—theory, techniques, and applications. In: Meyers RA (ed). Springer, New York, pp 2912–2924

  14. Berry MW, Castellanos M (2004) Survey of text mining. Springer, New York

  15. Koller D, Sahami M (1997) Proceedings of ICML-97, 14th International Conference on machine learning. Morgan Kaufmann Publishers, Burlington, pp 170–178

  16. Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503

    Article  Google Scholar 

  17. Zhu X (2006) Semi-supervised learning literature survey. Comput Sci Univ Wis Madison 2:3

    Google Scholar 

  18. Jiang J, Yan X, Yu Z, Guo J, Tian W (2014) A Chinese expert disambiguation method based on semi-supervised graph clustering. Int J Mac Learn Cybern :1–8 (2014)

  19. Maulik U, Chakraborty D (2012) A novel semisupervised SVM for pixel classification of remote sensing imagery. Int J Mac Learn Cybern 3(3):247

    Article  Google Scholar 

  20. Chen WJ, Shao YH, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mac Learn Cybern 5(3):459

    Article  Google Scholar 

  21. Tanha J, van Someren M, Afsarmanesh H (2015) Semi-supervised self-training for decision tree classifiers. Int J Mac Learn Cybern :1–16

  22. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103

    Article  MATH  Google Scholar 

  23. Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl Based Syst 54:298

    Article  Google Scholar 

  24. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM, pp 380–388, 2002

  25. Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 141–150, 2007

  26. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 89–98, 2003

  27. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. ICML 99:258–267

  28. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110

    Article  Google Scholar 

  29. Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118

    Article  Google Scholar 

  30. Traud AL, Kelsic ED, Mucha PJ, Porter MA (2011) Comparing community structure to characteristics in online collegiate social networks. SIAM Rev 53(3):526

    Article  MathSciNet  Google Scholar 

  31. Traud AL, Mucha PJ, Porter MA (2012) Social structure of Facebook networks. Phys A Stat Mech Appl 391(16):4165

    Article  Google Scholar 

  32. Lee C, Cunningham P (2014) Community detection: effective evaluation on large social networks. J Comp Netw 2(1):19

    Article  Google Scholar 

  33. Gargi U, Lu W, Mirrokni VS, Yoon S (2011) Large-Scale Community Detection on YouTube for Topic Discovery and Exploration. ICWSM

  34. Subbian K, Aggarwal CC, Srivastava J, Yu PS (2013) Community Detection with Prior Knowledge. In: Proceedings of the 2013 SIAM International Conference on data mining, SIAM, pp 405–413, 2013

  35. Yang T, Jin R, Chi Y, Zhu S (2009) Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 927–936, 2009

  36. Gopalan PK, Blei DM (2013) Efficient discovery of overlapping communities in massive networks. Proc Natl Acad Sci 110(36):14534

    Article  MathSciNet  MATH  Google Scholar 

  37. Andersen R, Gleich DF, Mirrokni V (2012) Overlapping clusters for distributed computation. In: Proceedings of the fifth ACM international conference on Web search and data mining, ACM, pp 273–282, 2012

  38. Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM , pp 597–605, 2012

  39. Xie J, Kelley S, Szymanski EK (2013) Overlapping Community Detection in Networks: The State-of-the-art and Comparative Study. ACM Comput Surv 45(4):43. doi:10.1145/2501654.2501657

  40. Xie J, Szymanski BK, Liu X (2011) Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, pp 344–349, 2011

  41. Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103018

    Article  Google Scholar 

  42. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PloS One 6(4):e18961

    Article  Google Scholar 

  43. Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814

    Article  Google Scholar 

  44. Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84(3):036103

    Article  Google Scholar 

  45. Chapelle O, Schölkopf B, Zien A (2006) Risks of semi-supervisedl earning: how unlabeled data can degrade performance of generative classifiers, in semi-supervised learning. MIT Press, Massachusetts , pp 57–72

  46. Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761

    Article  Google Scholar 

  47. Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. Data mining, 2002, ICDM 2003. Proceedings. 2002 IEEE International Conference on, IEEE, pp 139–146, 2002

  48. Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167

    Article  MathSciNet  MATH  Google Scholar 

  49. Stoffel K, Belkoniene A (1999) Parallel K/h-Means Clustering for Large Data Sets. In: Proceedings of the 5th International Euro-Par Conference on parallel processing. Springer, New York, pp 1451–1454, (Euro-Par ’99)

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their comments. This research is partially supported by grants from the National Natural Science Foundation of China (No. 61303167 and No. 61363047) and by funds from the Basic Research Program of Shenzhen (Grant No. JCYJ20130401170306838).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changxing Shang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shang, C., Feng, S., Zhao, Z. et al. Efficiently detecting overlapping communities using seeding and semi-supervised learning. Int. J. Mach. Learn. & Cyber. 8, 455–468 (2017). https://doi.org/10.1007/s13042-015-0338-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0338-5

Keywords

Navigation