Abstract
Text comparison is a process of comparing and matching two or more texts to determine their similarities or differences. By calculating the similarity between two texts, tasks such as classification, clustering, retrieval, and comparison can be performed on texts. In this work, we have improved existing text matching methods based on ElasticSearch and dynamic programming. Leveraging the powerful indexing and search capabilities of ElasticSearch, our method enables fast retrieval and comparison of relevant documents. During the text comparison process, we utilize an improved LCS (Longest Common Subsequence) algorithm to calculate the matches between the texts. We conduct extensive experiments on real-world datasets to evaluate the performance and effectiveness of our method. The results demonstrate that our approach can accomplish text comparison tasks more efficiently while handling various types of text noise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alleman, M., Mamou, J., Rio, M.A.D., Tang, H., Kim, Y., Chung, S.: Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models (2021). https://doi.org/10.48550/arXiv.2104.07578
Atabuzzaman, M., Shajalal, M., Ahmed, M.E., Afjal, M.I., Aono, M.: Leveraging grammatical roles for measuring semantic similarity between texts. IEEE Access 9, 62972–62983 (2021). https://doi.org/10.1109/ACCESS.2021.3074747
Cao, S., Yang, Y.: DP-BERT: dynamic programming BERT for text summarization. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds.) CICAI 2021. LNCS, vol. 13070, pp. 285–296. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93049-3_24
Castro, A.P., Wainer, G.A., Calixto, W.P.: Weighting construction by bag-of-words with similarity-learning and supervised training for classification models in court text documents. Appl. Soft Comput. 124, 108987 (2022). https://doi.org/10.1016/j.asoc.2022.108987
Das, D., Saha, B.: Approximating LCS and alignment distance over multiple sequences. CoRR abs/2110.12402 (2021). https://doi.org/10.48550/arXiv.2110.12402
Guo, W., Wang, Z., Han, F.: Multifeature fusion keyword extraction algorithm based on textrank. IEEE Access 10, 71805–71813 (2022). https://doi.org/10.1109/ACCESS.2022.3188861
Huang, J., Fang, Z., Kasai, H.: LCS graph kernel based on Wasserstein distance in longest common subsequence metric space. Signal Process. 189, 108281 (2021). https://doi.org/10.1016/j.sigpro.2021.108281
Inan, E.: Simit: a text similarity method using lexicon and dependency representations. New Gener. Comput. 38(3), 509–530 (2020). https://doi.org/10.1007/s00354-020-00099-8
Jalilifard, A., Caridá, V.F., Mansano, A., Cristo, R.: Semantic sensitive TF-IDF to determine word relevance in documents. CoRR abs/2001.09896 (2020). https://doi.org/10.48550/arXiv.2001.09896
Kalbaliyev, E., Rustamov, S.: Text similarity detection using machine learning algorithms with character-based similarity measures. In: Biele, C., Kacprzyk, J., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds.) MIDI 2020. AISC, vol. 1376, pp. 11–19. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74728-2_2
Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Extending neural keyword extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 22–29. Association for Computational Linguistics (2021). www.aclanthology.org/2021.hackashop-1.4
Korfhage, N., Mühling, M., Freisleben, B.: ElasticHash: semantic image similarity search by deep hashing with elasticsearch. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13053, pp. 14–23. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89131-2_2
Kuppili, V., Biswas, M., Edla, D.R., Prasad, K.J.R., Suri, J.S.: A mechanics-based similarity measure for text classification in machine learning paradigm. IEEE Trans. Emerg. Top. Comput. Intell. 4(2), 180–200 (2020). https://doi.org/10.1109/TETCI.2018.2863728
Lim, J., Sa, I., Ahn, H.S., Gasteiger, N., Lee, S.J., MacDonald, B.: Subsentence extraction from text using coverage-based deep learning language models. Sensors 21(8), 2712 (2021). https://doi.org/10.3390/s21082712
Liu, Z., Shi, Q., Ou, J.: LCS: a collaborative optimization framework of vector extraction and semantic segmentation for building extraction. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3215852
Marcińczuk, M., Gniewkowski, M., Walkowiak, T., Bȩdkowski, M.: Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association (2021). www.aclanthology.org/2021.gwc-1.24
Murakami, R., Chakraborty, B.: Investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts. Sensors 22(3), 852 (2022). https://doi.org/10.3390/s22030852
Qin, J., Zhou, Z., Tan, Y., Xiang, X., He, Z.: A big data text coverless information hiding based on topic distribution and TF-IDF. Int. J. Digit. Crime Forensics 13(4), 40–56 (2021). https://doi.org/10.4018/ijdcf.20210701.oa4
Romanov, A.S., Kurtukova, A.V., Sobolev, A.A., Shelupanov, A.A., Fedotova, A.M.: Determining the age of the author of the text based on deep neural network models. Information 11(12), 589 (2020). https://doi.org/10.3390/info11120589
Rosenberg, J., Coronel, J.B., Meiring, J., Gray, S., Brown, T.: Leveraging elasticsearch to improve data discoverability in science gateways. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, 28 July–01 August 2019, pp. 19:1–19:5. ACM (2019). https://doi.org/10.1145/3332186.3332230
Sakai, Y.: A substring-substring LCS data structure. Theor. Comput. Sci. 753, 16–34 (2019). https://doi.org/10.1016/j.tcs.2018.06.034
Sakai, Y.: A data structure for substring-substring LCS length queries. Theoret. Comput. Sci. 911, 41–54 (2022). https://doi.org/10.1016/j.tcs.2022.02.004
Shang, W., Underwood, T.: Improving measures of text reuse in English poetry: A TF–IDF based method. In: Toeppe, K., Yan, H., Chu, S.K.W. (eds.) iConference 2021. LNCS, vol. 12645, pp. 469–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71292-1_36
Sheshasaayee, A., Thailambal, G.: Performance of multiple string matching algorithms in text mining. In: Satapathy, S.C., Bhateja, V., Udgata, S.K., Pattnaik, P.K. (eds.) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. AISC, vol. 516, pp. 671–681. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3156-4_71
Sinha, A., Naskar, M.B., Pandey, M., Rautaray, S.S.: Text classification using machine learning techniques: comparative analysis. In: 2022 OITS International Conference on Information Technology (OCIT), pp. 102–107 (2022). https://doi.org/10.1109/OCIT56763.2022.00029
Sun, J., Nie, P., Xu, L., Zhang, H.: Design and implementation of analyzer management system based on elasticsearch. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds.) WISA 2022. LNCS, vol. 13579, pp. 254–266. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20309-1_22
Van, D.N., Trung, S.N., Hong, A.P.T., Hoang, T.T., Thanh, T.M.: A novel approach to end-to-end facial recognition framework with virtual search engine elasticsearch. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12951, pp. 454–470. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86970-0_32
Vishnupriya, G., Ramachandran, R.: Rabin-Karp algorithm based malevolent node detection and energy-efficient data gathering approach in wireless sensor network. Microprocess. Microsyst. 82, 103829 (2021). https://doi.org/10.1016/j.micpro.2021.103829
Wei, B., Dai, J., Deng, L., Huang, H.: An optimization method for elasticsearch index shard number. In: 2020 16th International Conference on Computational Intelligence and Security (CIS), pp. 191–195 (2020). https://doi.org/10.1109/CIS52066.2020.00048
Yang, W., Li, H., Li, Y., Zou, Y., Zhao, H.: Design and implementation of intelligent warehouse platform based on elasticsearch. In: 6th International Conference on Software and e-Business, ICSEB 2022, Shenzhen, China, 9–11 December 2022, pp. 69–73. ACM (2022). https://doi.org/10.1145/3578997.3579016
Yao, J., Wang, K., Yan, J.: Incorporating label co-occurrence into neural network-based models for multi-label text classification. IEEE Access 7, 183580–183588 (2019). https://doi.org/10.1109/ACCESS.2019.2960626
Zamfir, V., Carabas, M., Carabas, C., Tapus, N.: Systems monitoring and big data analysis using the elasticsearch system. In: 22nd International Conference on Control Systems and Computer Science, CSCS 2019, Bucharest, Romania, 28–30 May 2019, pp. 188–193. IEEE (2019). https://doi.org/10.1109/CSCS.2019.00039
Zandigohar, M., Dai, Y.: Information retrieval in single cell chromatin analysis using TF-IDF transformation methods. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, 6–8 December 2022, pp. 877–882. IEEE (2022). https://doi.org/10.1109/BIBM55620.2022.9994949
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xiao, P., Lu, P., Luo, C., Zhu, Z., Liao, X. (2023). Fast Text Comparison Based on ElasticSearch and Dynamic Programming. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-7254-8_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7253-1
Online ISBN: 978-981-99-7254-8
eBook Packages: Computer ScienceComputer Science (R0)