Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Enhanced clustering models with wiki-based k-nearest neighbors-based representation for web search result clustering

Published: 01 March 2022 Publication History

Abstract

Information retrieval is a difficult process due to the overabundance of information on the web. Nowadays, search result responds to user queries with too many results although only a few are relevant. Therefore, the existing clustering methods that fail in clustering snippets (short texts) of web documents due to the low frequencies of document terms should be deeply investigated. One of the approaches that can be used to solve this problem is the expansion of document terms with semantically similar terms. Hence, a list of terms with their closest and accurate semantically similar words (word representation) must be built. This study aims to design and develop a new framework to enhance the performance of web search result clustering (WSRC). The research also presents a new unsupervised distributed word representation scheme where each word is represented by a vector of its semantically related words; such as scheme expands snippets and user queries. The proposed framework consists of several activities, such as (1) various standard datasets (Open Directory Project [ODP]-239 and MORESQUE) that are used for evaluating search result clustering algorithms for most cited dataset works, (2) text pre-processing, (3) document representation based on a new wiki-based k-nearest neighbors (KNN) representation method, (4) effect of the proposed model on the performance of traditional clustering methods (k-means, k-medoids, single-linkage, and complete-linkage) for WSRC, and (5) evaluation stage of the proposed method. Results indicate that enhanced clustering methods, according to the new wiki-KNN based representation method in comparison with the baseline methods, show a significant improvement in WSRC. Furthermore, the new data representation scheme has enhanced the overall performance of clustering methods.

References

[1]
K. Abdalgader, Clustering short text using a centroid-based lexical clustering algorithm, IAENG Int. J. Comput. Sci. 44 (4) (2017).
[2]
L.M. Abualigah A.T. Khader M.A. Al-Betar (2016). Multi-objectives-based text clustering technique using K-mean algorithm. In Computer Science and Information Technology (CSIT), 2016 7th International Conference on (pp. 1–6). IEEE.
[3]
H.M. Alghamdi, A. Selamat, Arabic web page clustering: a review, J. King Saud Univ.-Comput. Inf. Sci. (2017).
[4]
S. Acharya, S. Saha, J.G. Moreno, G. Dias, August). Multi-objective search results clustering, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 99–108.
[5]
A.S. Abdulameer, S. Saad, L.Q. Zakaria, Trend detection in the Arabic social media using voting combination, Journal of Theoretical & Applied Information Technology 81 (3) (2015).
[6]
M. Alam, K. Sadaf, Web Search Result Clustering based on Heuristic Search and k-means, CoRR (2015).
[7]
E. Agichtein, E. Brill, S. Dumais, R. Ragno, Learning user interaction models for predicting web search result preferences, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2006, pp. 3–10.
[8]
M.T. Abd, M. Mohd, A comparative study of word representation methods with conditional random fields and maximum entropy markov for bio-named entity recognition, Malaysian J. Comput. Sci. (2018) 15–30.
[9]
R. Bentrcia, S. Zidat, F. Marir, Extracting semantic relations from the Quranic Arabic based on Arabic conjunctive patterns, J. King Saud University-Comput. Inf. Sci. 30 (3) (2018) 382–390.
[10]
M. Bressan, J. Vitria, Nonparametric discriminant analysis and nearest neighbor classification, Pattern Recogn. Lett. 24 (15) (2003) 2743–2749.
[11]
S. Borra, R. Thanki, N. Dey, Satellite Image Analysis: Clustering and Classification, Springer, 2019.
[12]
C. Carpineto, G. Romano, Optimal meta search results clustering, in: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, ACM, 2010, pp. 170–177.
[13]
K.W. Church, P. Hanks, Word association norms, mutual information, and lexicography, Computational ling. 16 (1) (1990) 22–29.
[14]
C. Carpineto, S. Osiński, G. Romano, D. Weiss, A survey of web clustering engines, ACM Comput. Surveys (CSUR) 41 (3) (2009) 17.
[15]
C. Cobos, A. Duque, J. Bolaños, M. Mendoza, E. León, Algorithm for Clustering of Web Search Results from a Hyper-Heuristic approach, in: Mexican International Conference on Artificial Intelligence, Springer, Cham, 2016, pp. 285–316.
[16]
M. Durairaj, C. Vijitha, Educational data mining for prediction of student performance using clustering algorithms, Int. J. Comput. Sci. Inf. Technol. 5 (4) (2014) 5987–5991.
[17]
Gabrilovich, E., Markovitch, S. (2005). Feature generation for text categorization using world knowledge. In IJCAI (Vol. 5, pp. 1048-1053).
[18]
P. Goyal, N. Mehala, D. Bhatia, N. Goyal, Topical document clustering: two-stage post processing technique, Int. J. Data Mining, Modell. Manage. 10 (2) (2018) 127–170.
[19]
J. Gou, W. Qiu, Z. Yi, Y. Xu, Q. Mao, Y. Zhan, A local mean representation-based K-Nearest Neighbor Classifier, ACM Trans. Intell. Syst. Technol. 10 (3) (2019) 1–25,.
[20]
Han and M. Kamber (2006). Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, 2nd edition.
[21]
Hu, J., Fang, L., Cao, Y., Zeng, H. J., Li, H., Yang, Q., Chen, Z. (2008). Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179-186). ACM.
[22]
X. Hu, L. Tang, H. Liu, Embracing information explosion without choking: Clustering and labeling in microblogging, IEEE Trans. Big Data 1 (1) (2015) 35–46.
[23]
Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Disc. 2 (3) (1998) 283–304.
[24]
A.K. Jain, R.C. Dubes, Algorithms for clustering data, Vol. 6, Prentice hall, Englewood Cliffs, NJ, 1988.
[25]
D. Jurafsky, J.H. Martin, Speech and language processing, Vol. 3, Pearson, London, 2014.
[26]
T. Korenius, J. Laurikkala, M. Juhola, On principal component analysis, cosine and Euclidean measures in information retrieval, Inf. Sci. 177 (22) (2007) 4893–4905,.
[27]
W.B.A. Karaa, N. Dey, Mining multimedia documents, Chapman and Hall/CRC, 2017.
[28]
W.B.A. Karaa, A.S. Ashour, D.B. Sassi, P. Roy, N. Kausar, N. Dey, Medline text mining: an enhancement genetic algorithm based approach for document clustering, in: Applications of Intelligent Optimization in Biology and Medicine, Springer, Cham, 2016, pp. 267–287.
[29]
Kumar, A., & Ashraf, M. (2015, March). Personalized web search engine using dynamic user profile and clustering techniques. In Computing for Sustainable Global Development (INDIACom), 2015 2nd International Conference on (pp. 2105-2108). IEEE.
[30]
Ling, Y., An, Y., Liu, M., Hasan, S. A., Fan, Y., & Hu, X. (2017, May). Integrating extra knowledge into word embedding models for biomedical nlp tasks. In 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 968-975). IEEE.
[31]
H.P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM J. Res. Dev. 1 (4) (1957) 309–317.
[32]
Larsen, B., & Aone, C. (1999, August). Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 16-22). ACM.
[33]
K.W.T. Leung, W. Ng, D.L. Lee, Personalized concept-based clustering of search engine queries, IEEE Trans. Knowl. Data Eng. 20 (11) (2008) 1505–1518.
[34]
Madhulatha, T. S. (2012). An overview on clustering methods. arXiv preprint arXiv:1205.1117.
[35]
S. Mitra, M. Hasanuzzaman, S. Saha, A. Way, Incorporating Deep Visual Features into Multiobjective based Multi-view Search Results Clustering, in: Paper presented at the Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1–13.
[36]
A. Mojahed, B. de la Iglesia, An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach, Knowl. Inf. Syst. 50 (1) (2017) 27–52.
[37]
F. Murtagh, P. Contreras, Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews, Data Min. Knowl. Disc. 2 (1) (2012) 86–97.
[38]
J.G. Moreno, G. Dias, April). Easy web search results clustering: When baselines can reach state-of-the-art algorithms, in: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, 2014, pp. 1–5.
[39]
Moreno, J.G., Dias, G., & Cleuziou, G. (2014). Query log driven web search results clustering. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 777-786). ACM.
[40]
M. Mousavi, A.A. Bakar, M. Vakilian, Data stream clustering algorithms: a review, Int. J. Adv. Soft. Comput. Appl. 7 (3) (2015) 13.
[41]
H.M. Mustafa, Masri Ayob, G. Kendall, An improved adaptive memetic differential evolution optimization algorithms for data clustering problems, PloS one 14 (5) (2019).
[42]
R. Navigli, G. Crisafulli, Inducing word senses to improve web search result clustering, Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 116-126), 2010.
[43]
M.S. Rani, G.C. Babu, Efficient Query Clustering Technique and Context Well-Informed Document Clustering, in: Soft Computing and Signal Processing, Springer, Singapore, 2019, pp. 261–271.
[44]
S. Saha, S. Mitra, S. Kramer, Exploring Multiobjective Optimization for Multiview Clustering, ACM Transactions on Knowledge Discovery from Data (TKDD) 12 (4) (2018) (pp. 44).
[45]
Salton, G., McGill, M.J. (1983). Introduction to modern information retrieval. The Transactions of the Institute of Electronics,Information and Communication Engineers. (pp.487-494).
[46]
Song, Y., Wang, H., Wang, Z., Li, H., & Chen, W. (2011, July). Short text conceptualization using a probabilistic knowledgebase. In Proceedings of the twenty-second international joint conference on artificial intelligence-volume volume three (pp. 2330-2336). AAAI Press.
[47]
M. Sah, V. Wade (Eds.), Personalized concept-based search and exploration on the web of data using results categorization, Springer, Berlin, Heidelberg, 2013, pp. 532–547.
[48]
Sontag, D., Collins-Thompson, K., Bennett, P. N., White, R. W., Dumais, S., & Billerbeck, B. (2012, February). Probabilistic models for personalizing web search. In Proceedings of the fifth ACM international conference on Web search and data mining (pp. 433-442). ACM.
[49]
P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data, Mining Ed., Pearson Education, Limited, 2014.
[50]
A.Y. Taha, S. Tiun, Binary relevance (BR) Method Classifier of multi-label classification for arabic text, J. Theor. Appl. Inf. Technol. (2016).
[51]
Tiun, S. (2017, November). Experiments on malay short text classification. In 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI) (pp. 1-4). IEEE.
[52]
Vijayalakshmi, K., & Jena, S. (2015, September). Web Usage Classification and Clustering Approach for Web Search Personalization. In Proceedings of the Sixth International Conference on Computer and Communication Technology 2015 (pp. 376-383). ACM.
[53]
Wahid, A., Gao, X., & Andreae, P. (2014, July). Multi-view clustering of web documents using multi-objective genetic algorithm. In 2014 IEEE Congress on Evolutionary Computation (CEC) (pp. 2625-2632). IEEE.
[54]
Wahid, A., X. Gao & P. Andreae (2016). Multi-objective multi-view clustering ensemble based on evolutionary approach. Evolutionary Computation (CEC), 2015 IEEE Congress on. (pp.1696-1703).
[55]
J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, J. Zhao, Self-taught convolutional neural networks for short text clustering, Neural Networks 88 (2017) 22–31.
[56]
R.B. Zadeh, S. Ben-David, June). A uniqueness theorem, AUAI Press, 2009, pp. 639–646.
[57]
M.M. Zaw, E.E. Mon, Web document clustering using cuckoo search clustering algorithm based on levy flight, Int. J. Innovation Appl. Stud. 4 (1) (2013) 182–188.

Cited By

View all
  • (2023)A Low-Complexity Channel Estimation in Internet of Vehicles in Intelligent Transportation Systems for 5G CommunicationJournal of Organizational and End User Computing10.4018/JOEUC.32675935:1(1-21)Online publication date: 28-Jul-2023

Index Terms

  1. Enhanced clustering models with wiki-based k-nearest neighbors-based representation for web search result clustering
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Journal of King Saud University - Computer and Information Sciences
        Journal of King Saud University - Computer and Information Sciences  Volume 34, Issue 3
        Mar 2022
        505 pages

        Publisher

        Elsevier Science Inc.

        United States

        Publication History

        Published: 01 March 2022

        Author Tags

        1. Clustering methods
        2. Web search result
        3. Word representation
        4. Query expansion

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 26 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)A Low-Complexity Channel Estimation in Internet of Vehicles in Intelligent Transportation Systems for 5G CommunicationJournal of Organizational and End User Computing10.4018/JOEUC.32675935:1(1-21)Online publication date: 28-Jul-2023

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media