Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3310986.3310991acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlscConference Proceedingsconference-collections
research-article

Clustering Stability via Concept-based Nonnegative Matrix Factorization

Published: 25 January 2019 Publication History

Abstract

One of the most important contributions of topic modeling is to accurately and the ectively discover and classify documents in a collection of texts by a number of clusters/topics. However, finding an appropriate number of topics is a particularly challenging model selection question. In this context, we introduce a new unsupervised conceptual stability framework to access the validity of a clustering solution. We integrate the proposed framework into nonnegative matrix factorization (NMF) to guide the selection of desired number of topics. Our model provides a exible way to enhance the interpretation of NMF for the effective clustering solutions. The work presented in this paper crosses the bridge between stability-based validation of clustering solutions and NMF in the context of unsupervised learning. We perform a thorough evaluation of our approach over a wide range of real-world datasets and compare it to current state-of-the-art which are two NMF-based approaches and four Latent Dirichlet Allocation (LDA) based models. the quantitative experimental results show that integrating such conceptual stability analysis into NMF can lead to significant improvements in the document clustering and information retrieval the ectiveness.

References

[1]
R Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 391--402.
[2]
Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. 2016. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing 173 (2016), 245--255.
[3]
Mark Belford, Brian Mac Namee, and Derek Greene. 2017. Stability of Topic Modeling via Matrix Factorization. arXiv preprint arXiv:1702.07186 (2017).
[4]
Shai Ben-David, David Pal, and Hans Ulrich Simon. 2007. Stability of k-means clustering. In International Conference on Computational Learning Šeory. Springer, 20--34.
[5]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca-tion. the Journal of machine Learning research 3 (2003), 993--1022.
[6]
Christos Boutsidis and Efstratios Gallopoulos. 2008. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition 41, 4 (2008), 1350--1362.
[7]
Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 12 (2004), 4164--4169.
[8]
Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. 2008. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 911--920.
[9]
Deng Cai, Xuanhui Wang, and Xiaofei He. 2009. Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the 26th annual international conference on machine learning. ACM, 105--112.
[10]
Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7 (2009), 1775--1781.
[11]
Ye Chen, Bei Yu, Xuewei Zhang, and Yihan Yu. 2016. Topic modeling for evalu-ating students' reflective writing: a case study of pre-service teachers' journals. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, 1--5.
[12]
Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.
[13]
Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numerique 17, 1 (2014), 61--84.
[14]
Ronald Fagin, Ravi Kumar, and D Sivakumar. 2003. Comparing top k lists. SIAM Journal on Discrete Mathematics 17, 1 (2003), 134--160.
[15]
Christiane Fellbaum. 1998. WordNet. Wiley Online Library.
[16]
Nicolas Gillis. 2014. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines 12, 257 (2014).
[17]
Derek Greene and Padraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning. ACM, 377--384.
[18]
Derek Greene, Derek OCallaghan, and Padraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.
[19]
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.
[20]
Kurt Hornik and Bettina Grun. 2011. Topicmodels: An R package for fitting topic models. Journal of Statistical Software 40, 13 (2011), 1--30.
[21]
Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnega-tive matrix factorization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 567--576.
[22]
Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. (2008).
[23]
Daichi Kitamura, Nobutaka Ono, Hiroshi Saruwatari, Yu Takahashi, and Kazunobu Kondo. 2016. Discriminative and reconstructive basis training for audio source separation with semi-supervised nonnegative matrix factorization. In Acoustic Signal Enhancement (IWAENC), 2016 IEEE International Workshop on. IEEE, 1--5.
[24]
Xiangwei Kong, Lin Liang, Tianshe Yang, Jing Zhao, and Xuhua Wang. 2015. Source separation based on nonnegative matrix factorization and independent component correlation algorithm. In 2015 8th International Congress on Image and Signal Processing (CISP). IEEE, 1614--1619.
[25]
Da Kuang, Jaegul Choo, and Haesun Park. 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms. Springer, 215--243.
[26]
Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (1955), 83--97.
[27]
Ken Lang. 1995. Newsweeder: Learning to €lter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331--339.
[28]
Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann. 2004. Stability-based validation of clustering solutions. Neural computation 16, 6 (2004), 1299--1323.
[29]
Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.
[30]
Erel Levine and Eytan Domany. 2001. Resampling method for unsupervised estimation of cluster validity. Neural computation 13, 11 (2001), 2573--2593.
[31]
Nicolai Meinshausen and Peter Buhlmann. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417--473.
[32]
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41.
[33]
Victor Mocioiu, Sreenath P Kyathanahally, Carles Arus,' Alfredo Vellido, and Margarida Julia-Sape. 2016. Automated Quality Control for Proton Magnetic Res-onance Spectroscopy Data Using Convex Non-negative Matrix Factorization. In International Conference on Bioinformatics and Biomedical Engineering. Springer, 719--727.
[34]
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 1--2 (2003), 91--118.
[35]
Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111--126.
[36]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[37]
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet al-location. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 569--577.
[38]
Jing Su et al. 2016. TopicListener: Observing Key Topics from Multi-channel Speech Audio Streams. In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 85--94.
[39]
Cheng Chuan Toh, Darsono Abdul Majid, Mohd Shakir, Md Saat, Awang Md Isa Azmi, and Hashim Norlezah. 2016. Blind Source Separation On Biomedical Field By Using Nonnegative Matrix Factorization. ARPN Journal Of Engineering And Applied Sciences 11, 13 (2016), 8200--8206.
[40]
Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences (2016), 201521171.
[41]
Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133--138.
[42]
B Xie, L Song, and H Park. 2013. Topic modeling via nonnegative matrix factorization on probability simplex. In NIPS workshop on topic models: computation, application, and evaluation.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing
January 2019
268 pages
ISBN:9781450366120
DOI:10.1145/3310986
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 January 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Concept Discovery
  2. Conceptual Stability
  3. Nonnegative Matrix Factorization
  4. Topic Modeling
  5. Unsupervised Document Clustering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLSC 2019

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 88
    Total Downloads
  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media