Abstract
Incorporating semantic information into any similarity metric increases its effectiveness and yields findings that may be further analyzed using human interpretation. There will be fewer accurate findings if the similarity is calculated based only on the text’s words. Three alternative approaches are shown in this study, each of which uses a feature vector that combines semantic information from readers and calculates similarities between them. These methods—LSA using word2vec, Explicit Semantic Analysis using Bag-of-Words, and Soft Cosine Similarity using TF-IDF—are based on textual data and knowledge-based methodologies. The technique produces simple-to-read documents that can be used in different information retrieval systems. When comparing commonalities between brief news texts, Latent Semantic Analysis employing Word2Vec Vectors outperformed the other two.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
S. Zhang, X. Zheng, C. Hu, A survey of semantic similarity and its application to social network analysis, in 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp. 2362–2367
T. Kenter, M. De Rijke, Short text similarity with word embeddings, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (2015), pp. 1411–1420
M. Oussalah, M. Mohamed, Knowledge-based sentence semantic similarity: algebraical properties. Prog. Artif. Intell. 11(1), 43–63 (2022)
E. Chersoni, E. Santus, L. Pannitto, A. Lenci, P. Blache, C.R. Huang, A structured distributional model of sentence meaning and processing. Nat. Lang. Eng. 25(4), 483–502 (2019)
M. Maru, S. Conia, M. Bevilacqua, R. Navigli, Nibbling at the hard core of word sense disambiguation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol .1: Long, 4724–4737 (2022)
D. Chandrasekaran, V. Mago, Evolution of semantic similarity—a survey. ACM Comput. Surv. (CSUR), 54(2), 1–37 (2021)
M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in International conference on machine learning (PMLR, 2015), pp. 957–966
N. Shibata, Y. Kajikawa, I. Sakata, How to measure the semantic similarities between scientific papers and patents in order to discover uncommercialized research fronts: A case study of solar cells, in PICMET 2010 Technology Management For Global Economic Growth (IEEE, 2010), pp. 1–6
H. Pu, G. Fei, H. Zhao, G. Hu, C. Jiao, Z. Xu, Short text similarity calculation using semantic information, in 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM) (IEEE, 2017), pp. 144–150
W.H. Gomaa, A.A. Fahmy, A survey of text similarity approaches. Int. J. Comp. Appl. 68(13), 13–18 (2013)
A. Kaundal, A. Kaur, A review on WordNet and Vector space analysis for short-text semantic similarity. Int. J. Innov. Eng. Technol. (2017)
E. Altszyler, M. Sigman, S. Ribeiro, D.F. Slezak, Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. arXiv preprint arXiv:1610.01520 (2016)
J.J. Lastra-Díaz, J. Goikoetxea, M.A.H. Taieb, A. García-Serrano, M.B. Aouicha, E. Agirre, A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Eng. Appl. Artif. Intell. 85, 645–665 (2019)
U. Srinivasarao, A. Sharaff, Email sentiment classification using lexicon-based opinion labelling, in Intelligent Computing and Communication Systems (Springer, Singapore, 2021), pp. 211–218
B. Altınel, M.C. Ganiz, Semantic text classification: a survey of past and recent advances. Inf. Proc. Manage. 54(6), 1129–1153 (2018)
M.A. Hadj Taieb, T. Zesch, M. Ben Aouicha, A survey of semantic relatedness evaluation datasets and procedures. Artif. Intell. Rev. 53(6), 4407–4448 (2020)
J.J. Lastra-Díaz, A. García-Serrano, M. Batet, M. Fernández, F. Chirigati, HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Inf. Syst. 66, 97–118 (2017)
U. Srinivasarao, A. Sharaff, Sentiment analysis from email pattern using feature selection algorithm. Expert Syst. e12867 (2021)
U. Srinivasarao, A. Sharaff, Email thread sentiment sequence identification using PLSA clustering algorithm. Expert Syst. Appl. 193, 116475 (2022)
Z. Quan, Z.J. Wang, Y. Le, B. Yao, K. Li, J. Yin, An efficient framework for sentence similarity modeling. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 27(4), 853–865 (2019)
A. Mahmoud, M. Zrigui, Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int. Arab J. Inf. Technol. 18(1), 1–7 (2021)
E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)
G. Sidorov, A. Gelbukh, H. Gómez-Adorno, D. Pinto, Soft similarity and soft cosine measure: similarity of features in vector space model. Comput. Sist. 18(3), 491–504 (2014)
P. Sitikhu, K. Pahi, P. Thapa, S. Shakya, A comparison of semantic similarity methods for maximum human interpretability, in 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1 (IEEE, 2019), pp. 1–4
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Srinivasarao, U., Karthikeyan, R., Bilal, M.J., Hariharan, S. (2023). Comparison of Different Similarity Methods for Text Categorization. In: Bhattacharya, A., Dutta, S., Dutta, P., Piuri, V. (eds) Innovations in Data Analytics. ICIDA 2022. Advances in Intelligent Systems and Computing, vol 1442. Springer, Singapore. https://doi.org/10.1007/978-981-99-0550-8_39
Download citation
DOI: https://doi.org/10.1007/978-981-99-0550-8_39
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0549-2
Online ISBN: 978-981-99-0550-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)