Nothing Special   »   [go: up one dir, main page]

Skip to main content

Topic Detection by Topic Model Induced Distance Using Biased Initiation

  • Conference paper
Advances in Computer Science and Information Technology (AST 2010, ACN 2010)

Abstract

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance measure method: the Topic Model (TM) induced distance. Assuming that the distribution of word is different in each topic, the documents can be treated as a sample of the mixture of k topic models, which can be estimated using expectation maximization (EM). A biased initiation method is proposed in this paper for topic decomposition using EM, which will generate a converged matrix for the generation of TM induced distance. The collections of web news are clustered into classes using this TM distance. A series of experiments are described on a corpus containing 5033 web news from 30 topics. K-means clustering is processed on test set with different topic numbers. A comparison of clustering result using the TM induced distance and the traditional cosine-like distance are given. The experiment results show that the proposed topic decomposition method using biased initiation is effective than the topic decomposition using random values. The TM induced distance will generate more topical groups than the VS model based cosine-like distance. In the web news collections containing related topics, the TM induced distance can achieve a better precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR 1998: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 37–45 (1998)

    Google Scholar 

  2. Chen, K.Y., Luesukprasert, L., Chou, S.T.: Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling. IEEE Trans. on Knowl. and Data Eng., 1016–1025 (2007)

    Google Scholar 

  3. Gildea, D., Hofmann, T.: Topic-Based Language models using EM. In: Proceedings of the 6th European Conference on Speech Communication and Technology, pp. 109–110 (1999)

    Google Scholar 

  4. Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Int. Res., 457–479 (2004)

    Google Scholar 

  5. Jain, A.K., Merty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv., 264–323 (1999)

    Google Scholar 

  6. Jiang, W., Guan, Y., Wang, X.: A Pragmatic Chinese Word Segmentation System. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, pp. 189–192 (2006)

    Google Scholar 

  7. Kelly, D., Díaz, F., Belkin, N.J., Allan, J.: A User-Centered Approach to Evaluating Topic Models. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 27–41. Springer, Heidelberg (2004)

    Google Scholar 

  8. Li, H., Yamanishi, K.: Topic analysis using a finite mixture model. In: Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora., pp. 35–44. ACM, NJ (2000)

    Google Scholar 

  9. Liu, S., Merhav, Y., Yee, W.G.: A sentence level probabilistic model for evolutionary theme pattern mining from news corpora. In: SAC 2009: Proceedings of the 2009 ACM symposium on Applied Computing, pp. 1742–1747. ACM, New York (2009)

    Chapter  Google Scholar 

  10. Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 198–207. ACM, New York (2005)

    Chapter  Google Scholar 

  11. Michael, S., George, K., Vipin, K.: A Comparison of Document Clustering Techniques. In: KDD 2000, 6th ACM SIGKDD International Conference on Data Mining, Sydney, pp. 109–110 (2000)

    Google Scholar 

  12. Pons-Porrata, A., Berlanga-Llavori, R., Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques. Inf. Process. Manage., 752–768 (2007)

    Google Scholar 

  13. Sun, B., Mitra, P., Giles, C.L., Yen, J., Zha, H.: Topic segmentation with shared topic detection and alignment of multiple documents. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 199–206. ACM, New York (2007)

    Chapter  Google Scholar 

  14. Zhai, C., Velielli, A., Yu, B.: A cross-collection mixture model for comparative text mining. In: KDD 2004: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 743–748. ACM, Seattle (2004)

    Chapter  Google Scholar 

  15. Zobel, J., Moffat, A.: Exploring the similarity space. In: SIGIR Forum, pp. 18–34. ACM, New York (1998)

    Google Scholar 

  16. The 2004 Topic Detection and Tracking Task Definition and Evaluation Plan (2004), http://www.nist.gov/speech/tests/tdt/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wu, Y., Ding, Y., Wang, X., Xu, J. (2010). Topic Detection by Topic Model Induced Distance Using Biased Initiation. In: Kim, Th., Adeli, H. (eds) Advances in Computer Science and Information Technology. AST ACN 2010 2010. Lecture Notes in Computer Science, vol 6059. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13577-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13577-4_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13576-7

  • Online ISBN: 978-3-642-13577-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics