Topic Detection by Topic Model Induced Distance Using Biased Initiation

Yonghui Wu^18,19,
Yuxin Ding¹⁹,
Xiaolong Wang^18,19 &
…
Jun Xu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6059))

Included in the following conference series:

1674 Accesses

Abstract

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance measure method: the Topic Model (TM) induced distance. Assuming that the distribution of word is different in each topic, the documents can be treated as a sample of the mixture of k topic models, which can be estimated using expectation maximization (EM). A biased initiation method is proposed in this paper for topic decomposition using EM, which will generate a converged matrix for the generation of TM induced distance. The collections of web news are clustered into classes using this TM distance. A series of experiments are described on a corpus containing 5033 web news from 30 topics. K-means clustering is processed on test set with different topic numbers. A comparison of clustering result using the TM induced distance and the traditional cosine-like distance are given. The experiment results show that the proposed topic decomposition method using biased initiation is effective than the topic decomposition using random values. The TM induced distance will generate more topical groups than the VS model based cosine-like distance. In the web news collections containing related topics, the TM induced distance can achieve a better precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Article Open access 09 May 2024

Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

References

Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR 1998: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–45 (1998)
Google Scholar
Chen, K.Y., Luesukprasert, L., Chou, S.T.: Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling. IEEE Trans. on Knowl. and Data Eng., 1016–1025 (2007)
Google Scholar
Gildea, D., Hofmann, T.: Topic-Based Language models using EM. In: Proceedings of the 6th European Conference on Speech Communication and Technology, pp. 109–110 (1999)
Google Scholar
Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Int. Res., 457–479 (2004)
Google Scholar
Jain, A.K., Merty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv., 264–323 (1999)
Google Scholar
Jiang, W., Guan, Y., Wang, X.: A Pragmatic Chinese Word Segmentation System. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, pp. 189–192 (2006)
Google Scholar
Kelly, D., Díaz, F., Belkin, N.J., Allan, J.: A User-Centered Approach to Evaluating Topic Models. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 27–41. Springer, Heidelberg (2004)
Google Scholar
Li, H., Yamanishi, K.: Topic analysis using a finite mixture model. In: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora., pp. 35–44. ACM, NJ (2000)
Google Scholar
Liu, S., Merhav, Y., Yee, W.G.: A sentence level probabilistic model for evolutionary theme pattern mining from news corpora. In: SAC 2009: Proceedings of the 2009 ACM symposium on Applied Computing, pp. 1742–1747. ACM, New York (2009)
Chapter Google Scholar
Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 198–207. ACM, New York (2005)
Chapter Google Scholar
Michael, S., George, K., Vipin, K.: A Comparison of Document Clustering Techniques. In: KDD 2000, 6th ACM SIGKDD International Conference on Data Mining, Sydney, pp. 109–110 (2000)
Google Scholar
Pons-Porrata, A., Berlanga-Llavori, R., Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques. Inf. Process. Manage., 752–768 (2007)
Google Scholar
Sun, B., Mitra, P., Giles, C.L., Yen, J., Zha, H.: Topic segmentation with shared topic detection and alignment of multiple documents. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 199–206. ACM, New York (2007)
Chapter Google Scholar
Zhai, C., Velielli, A., Yu, B.: A cross-collection mixture model for comparative text mining. In: KDD 2004: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 743–748. ACM, Seattle (2004)
Chapter Google Scholar
Zobel, J., Moffat, A.: Exploring the similarity space. In: SIGIR Forum, pp. 18–34. ACM, New York (1998)
Google Scholar
The 2004 Topic Detection and Tracking Task Definition and Evaluation Plan (2004), http://www.nist.gov/speech/tests/tdt/

Download references

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, People’s Republic of China
Yonghui Wu & Xiaolong Wang
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, People’s Republic of China
Yonghui Wu, Yuxin Ding, Xiaolong Wang & Jun Xu

Authors

Yonghui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hannam University, 133 Ojeong-dong, daeduk-gu, 306-791, Daejeon, South Korea
Tai-hoon Kim
The Ohio State University, 43210, Columbus, OH, USA
Hojjat Adeli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Y., Ding, Y., Wang, X., Xu, J. (2010). Topic Detection by Topic Model Induced Distance Using Biased Initiation. In: Kim, Th., Adeli, H. (eds) Advances in Computer Science and Information Technology. AST ACN 2010 2010. Lecture Notes in Computer Science, vol 6059. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13577-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-13577-4_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13576-7
Online ISBN: 978-3-642-13577-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topic Detection by Topic Model Induced Distance Using Biased Initiation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Topic Detection by Topic Model Induced Distance Using Biased Initiation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation