Abstract
Documents and authors can be clustered into “knowledge communities” based on the overlap in the papers they cite. We introduce a new clustering algorithm, Streemer, which finds cohesive foreground clusters embedded in a diffuse background, and use it to identify knowledge communities as foreground clusters of papers which share common citations. To analyze the evolution of these communities over time, we build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors. Findings include that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and if they use a narrow vocabulary.
Similar content being viewed by others
References
Blei D, Lafferty J (2006) Dynamic topic models. 23rd ICML, 113–120
Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of Chicago Press
Dhillon I, Guan Y (2003) Information theoretic clustering of sparse cooccurrence data. ICDM 517–520
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD, pp 269–274, ACM Press, New York
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, AAAI Press, Portland, OR, pp 226–231
Fern X, Brodley C (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. ICML, pp 186–193
Flake G, Lawrence S, Giles C (2000) Efficient identification of Web communities. KDD pp 150–160
Gibson D, Kleinberg J and Raghavan P (1998). Inferring web communities from link topology. ACM Press, New York
Griffith B, Small H, Stonehill J and Dey S (1974). The structure of scientific literatures II: toward a macro- and microstructure for Science. Sci Studies 4(4): 339–365
Guha S, Meyerson A, Mishra N, Motwani R and O’Callaghan L (2003). Clustering data streams: theory and practice. IEEE Trans Knowledge Data Eng 15(3): 515–528
Hopcroft J, Khan O, Kulis B, Selman B (2003) Natural communities in large linked networks. KDD, pp 541–546
Huang Q, Dom B, Steele D, Ashley J and Niblack W (1995). Foreground/background segmentation of color images by integration of multiple cues. IEEE Int Conf Image Process 1: 246–249
Kearns MJ, Mansour Y, Ng AY (1997) An information-theoretic analysis of hard and soft assignment methods for clustering. UAI, pp 282–293
McGann A (2002). The advantages of ideological cohesion a model of constituency representation and electoral competition in multi-party democracies. J Theor Politics 14(1): 37–70
McGovern A, Friedland L, Hay M, Gallagher B, Fast A, Neville J and Jensen D (2003). Exploiting relational structure to understand publication patterns in high-energy physics. SIGKDD Explor Newslett 5(2): 165–172
Pantel P, Lin D (2002) Document clustering with committees. SIGIR ’02, ACM Press, New York, pp 199–206
Popescul A, Flake G, Lawrence S, Ungar L, Giles C (2000) Clustering and identifying temporal trends in document databases. Advances in digital libraries, 2000. ADL 2000. proceedings. IEEE, pp 173–182
Savakis A (1998) Adaptive document image thresholding using foreground and background clustering. Proceedings of international conference on image processing ICIP98
Small H (2003). Paradigms, citations and maps of science: a personal history. J Am Soc Informat Sci Technol 54(5): 394–399
Small H and Crane D (1979). Specialties and disciplines in science and social science: an examination of their structure using citation indexes. Scientometrics 1(5): 445–461
Steinbach M, Karypis G and Kumar V (2000). A comparison of document clustering techniques. KDD workshop text mining 34: 35
Strehl A and Ghosh J (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3: 583–617
Sullivan D, White DH and Barboni EJ (1977). Co-citation analyses of science: an evaluation. Social Studies Sci 7(2): 223–240
Upham SP (2006) Communities of innovation. PhD thesis, University of Pennsylvania
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. KDD, pp 424–433
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kandylas, V., Upham, S.P. & Ungar, L.H. Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17, 335–354 (2008). https://doi.org/10.1007/s10115-008-0135-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0135-5