article

Ontology-assisted discovery of hierarchical topic clusters on the social web

Authors:

Kristian Slabbekoorn,

Takehiro TokudaAuthors Info & Claims

Journal of Web Engineering, Volume 15, Issue 5-6

Pages 361 - 396

Published: 01 November 2016 Publication History

Abstract

Discovery and clustering of users by their topic of interest on the Social Web can help enhance various applications, such as user recommendation and expert finding. Traditional approaches, such as latent semantic analysis-based topic modeling or k-means document clustering, run into issues when content is sparse, the number of existing topics is unknown and/or we seek topics that are hierarchical in nature. In this paper, we propose a method for ontology-assisted topic clustering, in which we map Social Web user content to ontological classes to overcome sparsity. Using a novel ranking technique for calculating the topical similarity between individuals at different topic scopes, we construct graphs on which we apply a quasi-clique algorithm in order to find topic clusters at that scope, without having to pre-define a target number of topics. Our approach allows (1) the topic scope to be controlled in order to discover general or specific topics; (2) the automatic labeling of clusters with tags that are human and machine-understandable; and (3) graphs to be clustered recursively in order to generate a hierarchy of topics. The approach is evaluated against ground truths of Twitter users and the 20-newsgroups dataset, commonly used in document clustering research. We compare our approach to standard and Twitter-specific latent Dirichlet allocation (LDA), hierarchical LDA, and standard and hierarchical k-means clustering. Results show that our method outperforms regular LDA by up to 24.7%, Twitter-LDA by up to 11.9%, and k-means by up to 26.7% on Social Web content. It performs equivalently, depending on several factors, to these approaches on a dataset of traditional documents. Additionally, our method can discover the appropriate number and composition of topics at a given topic scope, whereas k-means clustering cannot account for differences in scope.

References

[1]

Dbpedia wiki: The dbpedia ontology (2014). http://wiki.dbpedia.org/Ontology2014, Retrieved on April 14 2015.

[2]

F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing user modeling on twitter for personalized news recommendations. In User Modeling, Adaption and Personalization, pages 1-12. Springer, 2011.

Digital Library

[3]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In The Semantic Web, volume 4825 of Lecture Notes in Computer Science, chapter 52, pages 722-735. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2007.

[4]

C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far, 2009.

[5]

D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012.

Digital Library

[6]

D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2):7, 2010.

[7]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993-1022, 2003.

[8]

J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems, pages 121-124. ACM, 2013.

Digital Library

[9]

S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms-chapter 5, 2006.

[10]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391-407, 1990.

[11]

L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages 21-30. ACM, 2013.

Digital Library

[12]

I. Derényi, G. Palla, and T. Vicsek. Clique percolation in random networks. Physical review letters, 94(16):160202, 2005.

[13]

M. S. Granovetter. The strength of weak ties. American journal of sociology, pages 1360-1380, 1973.

[14]

T. B. Group. Social usage involves more platforms, more often. www.emarketer.com/Article/Social-Usage-Involves-More-Platforms-More-Often/1010019, Retrieved on February 19 2013.

[15]

W. V. Hage, A. Isaac, and Z. Aleksovski. Sample evaluation of ontology-matching systems. In Fifth Int. Workshop on Evaluation of Ontologies and Ontology-based Tools, ISWC 2007.

[16]

J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Applied statistics, pages 100-108, 1979.

[17]

E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information processing letters, 76(4):175-181, 2000.

[18]

M. Hausenblas and R. Cyganiak. Schema.rdfs.org. http://schema.rdfs.org/, Retrieved on April 20 2015.

[19]

T.-A. Hoang and E.-P. Lim. On joint modeling of topical communities and personal interest in microblogs. In Social Informatics, pages 1-16. Springer, 2014.

[20]

A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 541-544. IEEE, 2003.

Digital Library

[21]

A. K. Jain, R. C. Dubes, et al. Algorithms for clustering data, volume 6. Prentice hall Englewood Cliffs, 1988.

[22]

K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422-446, 2002.

Digital Library

[23]

S. Kiritchenko, F. Famili, S. Matwin, and R. Nock. Learning and evaluation in the presence of class hierarchies: Application to text categorization. 2006.

[24]

J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48-50, 1956.

[25]

A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. Körding, and L. A. N. Amaral. High-reproducibility and high-accuracy method for automated topic classification. Physical Review X, 5(1):011007, 2015.

[26]

K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th international conference on machine learning, pages 331-339, 1995.

Digital Library

[27]

J. Leskovec, K. J. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631-640. ACM, 2010.

Digital Library

[28]

X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou. Joint inference of named entity recognition and normalization for tweets. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 526-535. Association for Computational Linguistics, 2012.

Digital Library

[29]

B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442-451, 1975.

[30]

A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.

[31]

P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proc. of the 7th Intl. Conference on Semantic Systems, 2011.

Digital Library

[32]

M. Michelson and S. A. Macskassy. Discovering users' topics of interest on twitter: a first look. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data, pages 73-80. ACM, 2010.

Digital Library

[33]

G. A. Miller. WordNet: a lexical database for English. Commun. ACM, 38(11):39-41, 1995.

Digital Library

[34]

M. E. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577-8582, 2006.

[35]

S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos. Community detection in social media. Data Mining and Knowledge Discovery, 24(3):515-554, 2012.

Digital Library

[36]

O. U. Press. Rt this: Oup dictionary team monitors twitterer's tweets. http://blog.oup.com/2009/06/oxford-twitter/, 2009.

[37]

M. Qiu, F. Zhu, and J. Jiang. It is not just what we say, but how we say them: Lda-based behavior-topic model. SIAM.

[38]

J. Rennie. The 20 newsgroups data set. http://qwone.com/jason/20Newsgroups/, Retrieved on April 2 2015.

[39]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524-1534. Association for Computational Linguistics, 2011.

Digital Library

[40]

J. Ronallo. Html5 microdata and schema. org. Code4Lib Journal, 16, 2012.

[41]

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487-494. AUAI Press, 2004.

Digital Library

[42]

C. N. Silla Jr and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2):31-72, 2011.

Digital Library

[43]

K. Slabbekoorn, T. Noro, and T. Tokuda. Towards twitter user recommendation based on user relations and taxonomical analysis. In 23nd European-Japanese Conference on Information Modelling and Knowledge Bases (EJC), 2013, 2013.

[44]

K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11-21, 1972.

[45]

F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of WWW'07, pages 697-706, 2007.

[46]

L. Tang and H. Liu. Community detection and mining in social media. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1):1-137, 2010.

Digital Library

[47]

O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In Proceedings of the 7th International Conference on Weblogs and Social Media, 2013.

[48]

P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing & Management, 24(5):577-597, 1988.

Digital Library

[49]

S.-H. Yang, A. Kolcz, A. Schlaikjer, and P. Gupta. Large-scale high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1907-1916. ACM, 2014.

Digital Library

[50]

W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338-349. Springer, 2011.

Digital Library

[51]

Z. Zhao, S. Feng, Q. Wang, J. Z. Huang, G. J. Williams, and J. Fan. Topic oriented community detection through social objects and link analysis in social networks. Knowledge-Based Systems, 26:164-173, 2012.

Digital Library

Cited By

Razis GAnagnostopoulos IZeadally S(2020)Modeling Influence with Semantics in Social NetworksACM Computing Surveys10.1145/336978053:1(1-38)Online publication date: 6-Feb-2020
https://dl.acm.org/doi/10.1145/3369780
Beck MRizvi SDengel AAhmed S(2020)From Automatic Keyword Detection to Ontology-Based Topic ModelingDocument Analysis Systems10.1007/978-3-030-57058-3_32(451-465)Online publication date: 26-Jul-2020
https://dl.acm.org/doi/10.1007/978-3-030-57058-3_32

Recommendations

Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon

User-generated reviews on the Web reflect users' sentiment about products, services and social events. Existing researches mostly focus on the sentiment classification of the product and service reviews in document level. Reviews of social events such ...
Topic discovery based on text mining techniques

In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a ...
Extractive text summarization using clustering-based topic modeling
Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Web Engineering

Journal of Web Engineering Volume 15, Issue 5-6

November 2016

178 pages

ISSN:1540-9589

Issue’s Table of Contents

Publisher

Rinton Press, Incorporated

Paramus, NJ

Publication History

Published: 01 November 2016

Revised: 22 February 2016

Received: 12 June 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Razis GAnagnostopoulos IZeadally S(2020)Modeling Influence with Semantics in Social NetworksACM Computing Surveys10.1145/336978053:1(1-38)Online publication date: 6-Feb-2020
https://dl.acm.org/doi/10.1145/3369780
Beck MRizvi SDengel AAhmed S(2020)From Automatic Keyword Detection to Ontology-Based Topic ModelingDocument Analysis Systems10.1007/978-3-030-57058-3_32(451-465)Online publication date: 26-Jul-2020
https://dl.acm.org/doi/10.1007/978-3-030-57058-3_32

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents