Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2488388.2488483acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Efficient community detection in large networks using content and links

Published: 13 May 2013 Publication History

Abstract

In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks.
Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient.
We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering.
Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.

References

[1]
C. Aggarwal, Y. Zhao, and P. Yu. Outlier detection in graph streams. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 399--409. IEEE, 2011.
[2]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[3]
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC?98, pages 327--336. ACM, 1998.
[4]
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC'02, pages 380--388. ACM, 2002.
[5]
D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS'01, volume 13, page 430. The MIT Press, 2001.
[6]
I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In SIGKDD'04, pages 551--556. ACM, 2004.
[7]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220, 2004.
[8]
M. Ester, R. Ge, B. Gao, Z. Hu, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In SDM'06, pages 25--46, 2006.
[9]
S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75--174, 2010.
[10]
E. Fox and J. Shaw. Combination of multiple searches. NIST SPECIAL PUBLICATION SP, pages 243--243, 1994.
[11]
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl 1):5228, 2004.
[12]
S. Günnemann, B. Boden, and T. Seidl. Db-csc: a density-based approach for subspace clustering in graphs with feature vectors. In PKDD 2011, pages 565--580. Springer-Verlag, 2011.
[13]
D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(suppl 1):S145--S154, 2002.
[14]
D. Karger. Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383--413, 1999.
[15]
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998.
[16]
J. Leskovec, K. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631--640. ACM, 2010.
[17]
A. Maiya and T. Berger-Wolf. Sampling community structure. In WWW 2010, pages 701--710. ACM, 2010.
[18]
B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2:871--898, 1991.
[19]
M. Montague and J. Aslam. Relevance score normalization for metasearch. In CIKM'01, pages 427--433. ACM, 2001.
[20]
R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint latent topic models for text and citations. In SIGKDD'08, pages 542--550. ACM, 2008.
[21]
V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In SIGKDD'09, pages 737--746. ACM, 2009.
[22]
V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. In SIGMOD 2011, pages 721--732. ACM, 2011.
[23]
A. Strehl and J. Ghosh. Cluster ensembles?a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, 2003.
[24]
W. Tang, Z. Lu, and I. Dhillon. Clustering with multiple graphs. In ICDM'09, pages 1016--1021. IEEE, 2009.
[25]
Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD'08, pages 567--580, 2008.
[26]
I. Ulitsky and R. Shamir. Identification of functional modules using network topology and high-throughput data. BMC Systems Biology, 1(1):8, 2007.
[27]
S. van Dongen. Graph clustering by flow simulation. PhD Thesis, 2000.
[28]
T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In SIGKDD'09, pages 927--936. ACM, 2009.
[29]
D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW'06, pages 173--182. ACM, 2006.
[30]
Y. Zhou, H. Cheng, and J. Yu. Clustering large attributed graphs: An efficient incremental approach. In ICDM 2010, pages 689--698. IEEE, 2010.
[31]
S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classification using matrix factorization. In SIGIR'07, pages 487--494. ACM, 2007.

Cited By

View all
  • (2024)Mixbiotic society measures: Assessment of community well-going as living systemPLOS ONE10.1371/journal.pone.030740119:8(e0307401)Online publication date: 7-Aug-2024
  • (2024)SDAC-DA: Semi-Supervised Deep Attributed Clustering Using Dual AutoencoderIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338904936:11(6989-7002)Online publication date: Nov-2024
  • (2024)Mapping the Evolutionary Pattern of Mobile Products: A Phylogenetic ApproachIEEE Transactions on Engineering Management10.1109/TEM.2022.321448971(4776-4790)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Efficient community detection in large networks using content and links

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '13: Proceedings of the 22nd international conference on World Wide Web
    May 2013
    1628 pages
    ISBN:9781450320351
    DOI:10.1145/2488388

    Sponsors

    • NICBR: Nucleo de Informatcao e Coordenacao do Ponto BR
    • CGIBR: Comite Gestor da Internet no Brazil

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. content analysis
    2. graph clustering
    3. web mining

    Qualifiers

    • Research-article

    Conference

    WWW '13
    Sponsor:
    • NICBR
    • CGIBR
    WWW '13: 22nd International World Wide Web Conference
    May 13 - 17, 2013
    Rio de Janeiro, Brazil

    Acceptance Rates

    WWW '13 Paper Acceptance Rate 125 of 831 submissions, 15%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)69
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Mixbiotic society measures: Assessment of community well-going as living systemPLOS ONE10.1371/journal.pone.030740119:8(e0307401)Online publication date: 7-Aug-2024
    • (2024)SDAC-DA: Semi-Supervised Deep Attributed Clustering Using Dual AutoencoderIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338904936:11(6989-7002)Online publication date: Nov-2024
    • (2024)Mapping the Evolutionary Pattern of Mobile Products: A Phylogenetic ApproachIEEE Transactions on Engineering Management10.1109/TEM.2022.321448971(4776-4790)Online publication date: 2024
    • (2024)Efficient Multi-network Community Search Method for Distributed Graph Iterative Computation2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00050(306-311)Online publication date: 2-Jul-2024
    • (2024)Diverse joint nonnegative matrix tri-factorization for attributed graph clusteringApplied Soft Computing10.1016/j.asoc.2024.112012164(112012)Online publication date: Oct-2024
    • (2024)Evolutionary multi-objective overlapping community detection based on fusion of internal and external connectivity and correction of node intimacyApplied Soft Computing10.1016/j.asoc.2024.111414(111414)Online publication date: Feb-2024
    • (2024)Attribute subspace-guided multi-scale community detectionNeural Computing and Applications10.1007/s00521-024-09751-636:22(13975-13988)Online publication date: 2-May-2024
    • (2024)A multi-objective optimization approach for overlapping dynamic community detectionSoft Computing10.1007/s00500-024-09895-628:19(11323-11342)Online publication date: 24-Jul-2024
    • (2024)BackgroundFinding Communities in Social Networks Using Graph Embeddings10.1007/978-3-031-60916-9_2(17-36)Online publication date: 29-Apr-2024
    • (2023)Core-periphery structure in networks: A statistical expositionStatistics Surveys10.1214/23-SS14117:noneOnline publication date: 1-Jan-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media