Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Ontology-assisted discovery of hierarchical topic clusters on the social web

Published: 01 November 2016 Publication History

Abstract

Discovery and clustering of users by their topic of interest on the Social Web can help enhance various applications, such as user recommendation and expert finding. Traditional approaches, such as latent semantic analysis-based topic modeling or k-means document clustering, run into issues when content is sparse, the number of existing topics is unknown and/or we seek topics that are hierarchical in nature. In this paper, we propose a method for ontology-assisted topic clustering, in which we map Social Web user content to ontological classes to overcome sparsity. Using a novel ranking technique for calculating the topical similarity between individuals at different topic scopes, we construct graphs on which we apply a quasi-clique algorithm in order to find topic clusters at that scope, without having to pre-define a target number of topics. Our approach allows (1) the topic scope to be controlled in order to discover general or specific topics; (2) the automatic labeling of clusters with tags that are human and machine-understandable; and (3) graphs to be clustered recursively in order to generate a hierarchy of topics. The approach is evaluated against ground truths of Twitter users and the 20-newsgroups dataset, commonly used in document clustering research. We compare our approach to standard and Twitter-specific latent Dirichlet allocation (LDA), hierarchical LDA, and standard and hierarchical k-means clustering. Results show that our method outperforms regular LDA by up to 24.7%, Twitter-LDA by up to 11.9%, and k-means by up to 26.7% on Social Web content. It performs equivalently, depending on several factors, to these approaches on a dataset of traditional documents. Additionally, our method can discover the appropriate number and composition of topics at a given topic scope, whereas k-means clustering cannot account for differences in scope.

References

[1]
Dbpedia wiki: The dbpedia ontology (2014). http://wiki.dbpedia.org/Ontology2014, Retrieved on April 14 2015.
[2]
F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing user modeling on twitter for personalized news recommendations. In User Modeling, Adaption and Personalization, pages 1-12. Springer, 2011.
[3]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In The Semantic Web, volume 4825 of Lecture Notes in Computer Science, chapter 52, pages 722-735. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2007.
[4]
C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far, 2009.
[5]
D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012.
[6]
D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):7, 2010.
[7]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993-1022, 2003.
[8]
J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems, pages 121-124. ACM, 2013.
[9]
S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms-chapter 5, 2006.
[10]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391-407, 1990.
[11]
L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages 21-30. ACM, 2013.
[12]
I. Derényi, G. Palla, and T. Vicsek. Clique percolation in random networks. Physical review letters, 94(16):160202, 2005.
[13]
M. S. Granovetter. The strength of weak ties. American journal of sociology, pages 1360-1380, 1973.
[14]
T. B. Group. Social usage involves more platforms, more often. www.emarketer.com/Article/Social-Usage-Involves-More-Platforms-More-Often/1010019, Retrieved on February 19 2013.
[15]
W. V. Hage, A. Isaac, and Z. Aleksovski. Sample evaluation of ontology-matching systems. In Fifth Int. Workshop on Evaluation of Ontologies and Ontology-based Tools, ISWC 2007.
[16]
J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Applied statistics, pages 100-108, 1979.
[17]
E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information processing letters, 76(4):175-181, 2000.
[18]
M. Hausenblas and R. Cyganiak. Schema.rdfs.org. http://schema.rdfs.org/, Retrieved on April 20 2015.
[19]
T.-A. Hoang and E.-P. Lim. On joint modeling of topical communities and personal interest in microblogs. In Social Informatics, pages 1-16. Springer, 2014.
[20]
A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 541-544. IEEE, 2003.
[21]
A. K. Jain, R. C. Dubes, et al. Algorithms for clustering data, volume 6. Prentice hall Englewood Cliffs, 1988.
[22]
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422-446, 2002.
[23]
S. Kiritchenko, F. Famili, S. Matwin, and R. Nock. Learning and evaluation in the presence of class hierarchies: Application to text categorization. 2006.
[24]
J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48-50, 1956.
[25]
A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. Körding, and L. A. N. Amaral. High-reproducibility and high-accuracy method for automated topic classification. Physical Review X, 5(1):011007, 2015.
[26]
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th international conference on machine learning, pages 331-339, 1995.
[27]
J. Leskovec, K. J. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631-640. ACM, 2010.
[28]
X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou. Joint inference of named entity recognition and normalization for tweets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 526-535. Association for Computational Linguistics, 2012.
[29]
B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442-451, 1975.
[30]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
[31]
P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proc. of the 7th Intl. Conference on Semantic Systems, 2011.
[32]
M. Michelson and S. A. Macskassy. Discovering users' topics of interest on twitter: a first look. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data, pages 73-80. ACM, 2010.
[33]
G. A. Miller. WordNet: a lexical database for English. Commun. ACM, 38(11):39-41, 1995.
[34]
M. E. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577-8582, 2006.
[35]
S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos. Community detection in social media. Data Mining and Knowledge Discovery, 24(3):515-554, 2012.
[36]
O. U. Press. Rt this: Oup dictionary team monitors twitterer's tweets. http://blog.oup.com/2009/06/oxford-twitter/, 2009.
[37]
M. Qiu, F. Zhu, and J. Jiang. It is not just what we say, but how we say them: Lda-based behavior-topic model. SIAM.
[38]
J. Rennie. The 20 newsgroups data set. http://qwone.com/jason/20Newsgroups/, Retrieved on April 2 2015.
[39]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524-1534. Association for Computational Linguistics, 2011.
[40]
J. Ronallo. Html5 microdata and schema. org. Code4Lib Journal, 16, 2012.
[41]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487-494. AUAI Press, 2004.
[42]
C. N. Silla Jr and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2):31-72, 2011.
[43]
K. Slabbekoorn, T. Noro, and T. Tokuda. Towards twitter user recommendation based on user relations and taxonomical analysis. In 23nd European-Japanese Conference on Information Modelling and Knowledge Bases (EJC), 2013, 2013.
[44]
K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11-21, 1972.
[45]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of WWW'07, pages 697-706, 2007.
[46]
L. Tang and H. Liu. Community detection and mining in social media. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1):1-137, 2010.
[47]
O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In Proceedings of the 7th International Conference on Weblogs and Social Media, 2013.
[48]
P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management, 24(5):577-597, 1988.
[49]
S.-H. Yang, A. Kolcz, A. Schlaikjer, and P. Gupta. Large-scale high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1907-1916. ACM, 2014.
[50]
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338-349. Springer, 2011.
[51]
Z. Zhao, S. Feng, Q. Wang, J. Z. Huang, G. J. Williams, and J. Fan. Topic oriented community detection through social objects and link analysis in social networks. Knowledge-Based Systems, 26:164-173, 2012.

Cited By

View all
  • (2020)Modeling Influence with Semantics in Social NetworksACM Computing Surveys10.1145/336978053:1(1-38)Online publication date: 6-Feb-2020
  • (2020)From Automatic Keyword Detection to Ontology-Based Topic ModelingDocument Analysis Systems10.1007/978-3-030-57058-3_32(451-465)Online publication date: 26-Jul-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Web Engineering
Journal of Web Engineering  Volume 15, Issue 5-6
November 2016
178 pages

Publisher

Rinton Press, Incorporated

Paramus, NJ

Publication History

Published: 01 November 2016
Revised: 22 February 2016
Received: 12 June 2015

Author Tags

  1. community detection
  2. hierarchical clustering
  3. ontology
  4. social web
  5. topic modeling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Modeling Influence with Semantics in Social NetworksACM Computing Surveys10.1145/336978053:1(1-38)Online publication date: 6-Feb-2020
  • (2020)From Automatic Keyword Detection to Ontology-Based Topic ModelingDocument Analysis Systems10.1007/978-3-030-57058-3_32(451-465)Online publication date: 26-Jul-2020

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media