Comparison of Different Similarity Methods for Text Categorization

Ulligaddala Srinivasarao¹⁸,
R. Karthikeyan¹⁸,
Mohammad J Bilal¹⁹ &
…
Shanmugasundaram Hariharan²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1442))

Included in the following conference series:

International Conference on Innovations in Data Analytics

428 Accesses
3 Citations

Abstract

Incorporating semantic information into any similarity metric increases its effectiveness and yields findings that may be further analyzed using human interpretation. There will be fewer accurate findings if the similarity is calculated based only on the text’s words. Three alternative approaches are shown in this study, each of which uses a feature vector that combines semantic information from readers and calculates similarities between them. These methods—LSA using word2vec, Explicit Semantic Analysis using Bag-of-Words, and Soft Cosine Similarity using TF-IDF—are based on textual data and knowledge-based methodologies. The technique produces simple-to-read documents that can be used in different information retrieval systems. When comparing commonalities between brief news texts, Latent Semantic Analysis employing Word2Vec Vectors outperformed the other two.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Comparative Analysis and Implementation of Semantic-Based Classifiers

Text Similarity Function Based on Word Embeddings for Short Text Analysis

General Representation Model for Text Similarity

References

S. Zhang, X. Zheng, C. Hu, A survey of semantic similarity and its application to social network analysis, in 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp. 2362–2367
Google Scholar
T. Kenter, M. De Rijke, Short text similarity with word embeddings, in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (2015), pp. 1411–1420
Google Scholar
M. Oussalah, M. Mohamed, Knowledge-based sentence semantic similarity: algebraical properties. Prog. Artif. Intell. 11(1), 43–63 (2022)
Google Scholar
E. Chersoni, E. Santus, L. Pannitto, A. Lenci, P. Blache, C.R. Huang, A structured distributional model of sentence meaning and processing. Nat. Lang. Eng. 25(4), 483–502 (2019)
Google Scholar
M. Maru, S. Conia, M. Bevilacqua, R. Navigli, Nibbling at the hard core of word sense disambiguation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol .1: Long, 4724–4737 (2022)
Google Scholar
D. Chandrasekaran, V. Mago, Evolution of semantic similarity—a survey. ACM Comput. Surv. (CSUR), 54(2), 1–37 (2021)
Google Scholar
M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in International conference on machine learning (PMLR, 2015), pp. 957–966
Google Scholar
N. Shibata, Y. Kajikawa, I. Sakata, How to measure the semantic similarities between scientific papers and patents in order to discover uncommercialized research fronts: A case study of solar cells, in PICMET 2010 Technology Management For Global Economic Growth (IEEE, 2010), pp. 1–6
Google Scholar
H. Pu, G. Fei, H. Zhao, G. Hu, C. Jiao, Z. Xu, Short text similarity calculation using semantic information, in 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM) (IEEE, 2017), pp. 144–150
Google Scholar
W.H. Gomaa, A.A. Fahmy, A survey of text similarity approaches. Int. J. Comp. Appl. 68(13), 13–18 (2013)
Google Scholar
A. Kaundal, A. Kaur, A review on WordNet and Vector space analysis for short-text semantic similarity. Int. J. Innov. Eng. Technol. (2017)
Google Scholar
E. Altszyler, M. Sigman, S. Ribeiro, D.F. Slezak, Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. arXiv preprint arXiv:1610.01520 (2016)
J.J. Lastra-Díaz, J. Goikoetxea, M.A.H. Taieb, A. García-Serrano, M.B. Aouicha, E. Agirre, A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Eng. Appl. Artif. Intell. 85, 645–665 (2019)
Google Scholar
U. Srinivasarao, A. Sharaff, Email sentiment classification using lexicon-based opinion labelling, in Intelligent Computing and Communication Systems (Springer, Singapore, 2021), pp. 211–218
Google Scholar
B. Altınel, M.C. Ganiz, Semantic text classification: a survey of past and recent advances. Inf. Proc. Manage. 54(6), 1129–1153 (2018)
Google Scholar
M.A. Hadj Taieb, T. Zesch, M. Ben Aouicha, A survey of semantic relatedness evaluation datasets and procedures. Artif. Intell. Rev. 53(6), 4407–4448 (2020)
Google Scholar
J.J. Lastra-Díaz, A. García-Serrano, M. Batet, M. Fernández, F. Chirigati, HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Inf. Syst. 66, 97–118 (2017)
Google Scholar
U. Srinivasarao, A. Sharaff, Sentiment analysis from email pattern using feature selection algorithm. Expert Syst. e12867 (2021)
Google Scholar
U. Srinivasarao, A. Sharaff, Email thread sentiment sequence identification using PLSA clustering algorithm. Expert Syst. Appl. 193, 116475 (2022)
Google Scholar
Z. Quan, Z.J. Wang, Y. Le, B. Yao, K. Li, J. Yin, An efficient framework for sentence similarity modeling. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 27(4), 853–865 (2019)
Google Scholar
A. Mahmoud, M. Zrigui, Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int. Arab J. Inf. Technol. 18(1), 1–7 (2021)
Google Scholar
E. Gabrilovich, S. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)
Google Scholar
G. Sidorov, A. Gelbukh, H. Gómez-Adorno, D. Pinto, Soft similarity and soft cosine measure: similarity of features in vector space model. Comput. Sist. 18(3), 491–504 (2014)
Google Scholar
P. Sitikhu, K. Pahi, P. Thapa, S. Shakya, A comparison of semantic similarity methods for maximum human interpretability, in 2019 Artificial Intelligence for Transforming Business and Society (AITB), vol. 1 (IEEE, 2019), pp. 1–4
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering (AI & ML), Vardhaman College of Engineering, Hyderabad, India
Ulligaddala Srinivasarao & R. Karthikeyan
Department of Information Technology, Vardhaman College of Engineering, Hyderabad, India
Mohammad J Bilal
Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, India
Shanmugasundaram Hariharan

Authors

Ulligaddala Srinivasarao
View author publications
You can also search for this author in PubMed Google Scholar
R. Karthikeyan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad J Bilal
View author publications
You can also search for this author in PubMed Google Scholar
Shanmugasundaram Hariharan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ulligaddala Srinivasarao .

Editor information

Editors and Affiliations

Institute of Engineering & Management, kolkata, India
Abhishek Bhattacharya
Institute of Engineering & Management, Kolkata, West Bengal, India
Soumi Dutta
Visva-Bharati University, Shantiniketan, West Bengal, India
Paramartha Dutta
Universita' Degli Studi di Milano, Milano, Italy
Vincenzo Piuri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srinivasarao, U., Karthikeyan, R., Bilal, M.J., Hariharan, S. (2023). Comparison of Different Similarity Methods for Text Categorization. In: Bhattacharya, A., Dutta, S., Dutta, P., Piuri, V. (eds) Innovations in Data Analytics. ICIDA 2022. Advances in Intelligent Systems and Computing, vol 1442. Springer, Singapore. https://doi.org/10.1007/978-981-99-0550-8_39

Download citation

DOI: https://doi.org/10.1007/978-981-99-0550-8_39
Published: 01 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0549-2
Online ISBN: 978-981-99-0550-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Comparison of Different Similarity Methods for Text Categorization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Analysis and Implementation of Semantic-Based Classifiers

Text Similarity Function Based on Word Embeddings for Short Text Analysis

General Representation Model for Text Similarity

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Comparison of Different Similarity Methods for Text Categorization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Analysis and Implementation of Semantic-Based Classifiers

Text Similarity Function Based on Word Embeddings for Short Text Analysis

General Representation Model for Text Similarity

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation