Abstract
The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.
Similar content being viewed by others
References
Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern Part C 40(6):601–618
Xu Z, Ke Y (2016) Effective and efficient spectral clustering on text and link data. In: Proceedings of ACM international conference on information and knowledge management. pp 357–366
Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: Proceedings of international conference on learning representations
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162
Glasbey CA (1993) An analysis of histogram-based thresholding algorithms. CVGIP Graph Models Image Process 55(6):532–537
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. pp 849–856
Kaya M, Bilge HŞ (2019) Deep metric learning: a survey. Symmetry 11(9):1066
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning. pp 209–216
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(2):207
Kulis B et al (2012) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Proceedings of AAAI conference on artificial intelligence
Harwood B, Kumar BGV, Carneiro G, Reid I, Drummond T (2017) Smart mining for deep metric learning. In: Proceedings of the IEEE international conference on computer vision. pp 2821–2829
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. 58:64
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference, Christchurch, New Zealand. pp 49–56
Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84
Ghosal A, Nandy A, Das AK, Goswami S, Panday M (2020) A short review on different clustering techniques and their applications. Emerg Technol Modell Graph. https://doi.org/10.1007/978-981-13-7403-6_9
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Jain Anil K, Narasimha Murty M, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992)Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. pp 318–329. ACM
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 267–273. ACM
Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Ding CHQ, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1):45–55
Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW (2017) A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell 39(3):417–429
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Yang Y, Ma Z, Yang Y, Nie F, Heng TS (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094
Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. Proc Conf Comput Visi Pattern Recognit 2:1735–1742
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
Sam HE-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: European conference on principles of data mining and knowledge discovery. pp 424–431. Springer
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Ruxton Graeme D (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behav Ecol 17(4):688–690
Friedman JH et al (1994) Flexible metric nearest neighbor classification. Technical report, Technical report, Department of Statistics, Stanford University
De Amorim R, Cordeiro R, Boris M (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075
Author information
Authors and Affiliations
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A portion of the work had done when the authors were affiliated to Ramakrishna Mission Vivekananda Educational and Research Institute, Belur Math, West Bengal, India.
Appendix A: additional results
Appendix A: additional results
The performance of the spectral clustering technique using the postimpact similarity measure and the performance of the spectral clustering technique using some standard linear distance functions are reported in Table 9 in terms of NMI for all the corpora as described in Sect. 4.1. These distance functions are defined in Table 8 along with their range of values. Table 9 shows that postimpact similarity measure outperforms the other distance functions for text clustering using the spectral-based method. The t-test as described in Sect. 4.4 has been performed to check the statistical significance of the results in Table 9. It has been found that the results are significant for all the cases when spectral clustering algorithm using postimpact similarity performs better than the same using other distance functions in Table 9. The effectiveness of the postimpact similarity for text clustering can be observed from these results.
Rights and permissions
About this article
Cite this article
Roy, A.K., Basu, T. Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering. Knowl Inf Syst 64, 723–742 (2022). https://doi.org/10.1007/s10115-022-01658-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01658-9