Abstract
The problem of analysing dynamically evolving textual data has arisen within the last few years. An example of such data is the discussion appearing in Internet chat lines. In this Letter a recently introduced source separation method, termed as complexity pursuit, is applied to the problem of finding topics in dynamical text and is compared against several blind separation algorithms for the problem considered. Complexity pursuit is a generalisation of projection pursuit to time series and it is able to use both higher-order statistical measures and temporal dependency information in separating the topics. Experimental results on chat line and newsgroup data demonstrate that the minimum complexity time series indeed do correspond to meaningful topics inherent in the dynamical text data, and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Allan, J., Carbonell, J., Doddington, G., Yamron, J. and Yang, Y.: Topic detection and tracking pilot study. Final report, In: Proc. of DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 194–218.
Baeza-Yates, R. A. and Ribeiro-Neto, B.: Modern Information Retrieval, New York: ACM Press, 1999.
Belouchrani, A., Meraim, K. A., Cardoso, J.-F. and Moulines, E.: A blind source separation technique based on second order statistics, IEEE Tr. on Signal Processing, 45(2) (1997), 434–444.
Berry, M. W., Dumais, S. T. and Letsche, T. A.: Computational methods for intelligent information access, In: Proc. of Supercomputing '95,San Diego,CA: USA, 1995.
Bingham, E., Kabán, A. and Girolami, M.: Finding topics in dynamical text: application to chat line discussions, In: 10th Int. World Wide Web Conf. Poster Proc., 2001, pp. 198–199.
Comon, P.: Independent component analysis—a new concept? Signal Processing, 36 (1994), 287–314.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.: Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41(6) (1990), 391–407.
Friedman, J. H. and Tukey, J. W.: A projection pursuit algorithm for exploratory data analysis, IEEE Tr. of Computers, c-23(9) (1974), 881–890.
Hofmann, T.: Probabilistic Latent Semantic Analysis, In: Proc. 15th Annual Conf. on Uncertainty in Artificial Intelligence (UAI'99), Sweden: Stockholm, 1999.
Hyvärinen, A.: Fast and robust fixed-point algorithms for independent component analysis, IEEE Tr. on Neural Networks, 10(3) (1999), 626–634.
Hyvärinen, A.: Complexity pursuit: separating interesting components from time-series, Neural Computation, 13(4) (2001), 883–898.
Hyvärinen, A., Karhunen, J. and Oja, E.: Independent component analysis,Wiley Interscience, 2001.
Isbell, C. L. and Viola, P.: Restucturing sparse high dimensional data for effective retrieval, In: Advances in Neural Information Processing Systems 11, 1998,pp. 480–486.
Jutten, C. and Herault, J.: Blind separation of sources,part I: An adaptive algorithm based on neuromimetic architecture, Signal Processing, 24 (1991), 1–10.
Kabán, A. and Girolami, M.: Unsupervised topic separation and keyword identification in document collections: a projection approach,Technical Report 10, Dept. of Computing and Information Systems,Univ. of Paisley, 2000.
Kabán, A. and Girolami, M.: A combined latent class and trait model for the analysis and visualization of discrete data, IEEE Tr. on Pattern Analysis, 23(8) (2001), 859–872.
Kabán, A. and Girolami, M.: A dynamic probabilistic model to visualize topic evolution in text streams, Journal of Intelligent Information Systems, Special Issue on Automated Text Categorization, 18(2) (2002).
Katz, S.: Distribution of content words and phrases in text and language modeling, Natural Language Engineering, 2(1) (1996), 15–59.
Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V. and Saarela, A.: Self organization of a massive document collection, IEEE Tr. on Neural Networks, 11(3) (2000) 574–585. Special Issue on Neural Networks for Data Mining and Knowledge Discovery.
Kolenda, T. and Hansen, L. K.: Dynamical components of chat, Technical report Technical University of Denmark, 2000.
Kolenda, T., Hansen, L. K. and Larsen, J.: Signal detection using ICA: application to chat room topic spotting, In: Lee and Jung and Makeig and Sejnowski (eds.): Proc. of the Third International Conference on Independent Component Analysis and Signal Separation (ICA2001), San Diego, CA: USA pp. 540–545, 2001.
Kolenda, T., Hansen, L. K. and Sigurdsson, S.: Independent components in text, In: M. Girolami (ed.): Advances in Independent Component Analysis, Springer-Verlag, 2000, Chapt. 13,pp. 235–256.
Molgedey, L. and Schuster, H. G.: Separation of a mixture of independent signals using time delayed correlations, Phys. Review Letters, 72(23) (1994),3634–3637.
Müller, K.-R., Philips, P. and Ziehe, A.: JADETD: Combining higher-order statistics and temporal information for blind source separation (with noise), In: Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA '99), France: Aussois, 1999, pp. 87–92.
Pajunen, P.: Blind source separation using algorithmic information theory, Neurocomputing, 22 (1998), 35–48.
Pajunen, P.: Blind source separation of natural signals based on approximate complexity minimization, In: Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA '99), France: Aussois, 1999, pp. 267–270.
Papadimitriou, C., Raghavan, P., Tamaki, H. and Vempala, S.: Latent semantic indexing: a probabilistic analysis, In: Proc. 17th ACM Symp. Principles of Database Systems, Seattle, 1998,pp. 159–168.
Salton, G. and McGill, M.J.: Introduction to modern information retrieval, New York: McGraw-Hill, 1983.
Slaney, M. and Ponceleon, D.: Hierarchical segmentation: finding changes in a text signal, In: Proc. of the SIAM Text Mining 2001 Workshop,Chicago, IL: 2001, pp. 6–13.
Stone, J. V.: Blind source separation using temporal predictability, Neural Computation, 13(4) (2001).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bingham, E., Kabán, A. & Girolami, M. Topic Identification in Dynamical Text by Complexity Pursuit. Neural Processing Letters 17, 69–83 (2003). https://doi.org/10.1023/A:1022990829563
Issue Date:
DOI: https://doi.org/10.1023/A:1022990829563