Abstract
In the rapidly evolving landscape of education, the integration of Big Data and AI presents significant opportunities for improving educational outcomes, especially in the context of Predicting Students Performance (PSP) applications in higher education. Today, Educational Data Mining (EDM) strategies have been implemented to overcome educational challenges in advanced nations. Nonetheless, the issues confronting developing countries, like the unavailability of educational datasets, and challenges with selecting Machine Learning (ML) algorithms that are effective in terms of accuracy, bias, and over-fitting, have never been investigated. Therefore, a novel dataset, UOBEDM, collected from the University of Baluchistan (UoB) in the developing region of Balochistan, Pakistan, comprises 49,835 student records, providing valuable insights into various demographic and academic aspects. Through meticulous data collection and cleaning processes, including feature selection techniques, the dataset was refined to 23,492 instances. Various ML algorithms were fine-tuned on the UOBEDM dataset, with the top five algorithms—Trees, K-Nearest Neighbors (KNN), Naive Bayes (NB), Random Forest (RF), and Support Vector Machines (SVM)—yielding accuracy scores of 0.95, 0.94, 0.92, 0.96, and 0.50, respectively. A novel approach called the Cross-Classification Matrix (CCM) was introduced to assess algorithm performance and select the best model. Trees emerged as the optimal predicting algorithm, simplifying decision-making processes for academics through the development of a graphical tree-based Early Intervention Model (EIM). The significance of the dataset extends beyond classification algorithms, paving the way for research in EDM and addressing educational inequalities. This study underscores the potential of data-driven approaches to enhance educational outcomes and foster innovation in education. The findings contribute to the understanding of predictive modeling in education and provide valuable insights for educators, policymakers, and researchers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
UOBEDM data is available on GitHub page link: https://github.com/ImamDad/UOBEDM.
References
Parnell A. Advancing from prediction to prescription: strategies for proactively and thoughtfully addressing students’ needs. J Postsecond Stud Success. 2022;2(1):1–11.
Patil P, Hiremath R. Big data mining—analysis and prediction of data, based on student performance. In: Pervasive computing and social networking, 2022. pp. 201–215.
Mengash HA. Using data mining techniques to predict student performance to support decision making in university admission systems. Ieee Access. 2020;8:55462–70.
Namoun A, Alshanqiti A. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Appl Sci. 2020;11(1):237.
Aksangür İ, Eren B, Erden C. Evaluation of data preprocessing and feature selection process for prediction of hourly PM10 concentration using long short-term memory models. Environ Pollut. 2022;311: 119973.
Syed Mustapha SMFD. Predictive analysis of students’ learning performance using data mining techniques: A comparative study of feature selection methods. Appl Syst Innov. 2023;6(5):86.
Kukkar A, Mohana R, Sharma A, Nayyar A. A novel methodology using RNN+ LSTM+ ML for predicting student’s academic performance. Educ Inf Technol, 2024;1–37.
Hooda M, Rana C. Learning analytics lens: improving quality of higher education. Int J Emerg Trends Eng Res 2020.
Tan S. Harnessing artificial intelligence for innovation in education. In: Learning intelligence: innovative and digital transformative learning strategies: Cultural and social engineering perspectives, 2023. pp. 335–363.
Luhnen M, Ormstad SS, Willemsen A, Schreuder-Morel C, Helmink C, Ettinger S, Erdos J, Fathollah-Nejad R, Rehrmann M, Hviding K, Rüther A. Developing a quality management system for the European Network for Health Technology Assessment (EUnetHTA): toward European HTA collaboration. Int J Technol Assess Health Care. 2021;37(1): e59.
Albreiki B, Zaki N, Alashwal H. A systematic literature review of student’performance prediction using machine learning techniques. Educ Sci. 2021;11(9):552.
Bagunaid W, Chilamkurti N, Veeraraghavan P. AISAR: artificial intelligence-based student assessment and recommendation system for E-learning in big data. Sustainability. 2022;14(17):10551.
Youssef M, Mohammed S, Hamada EK, Wafaa BF. A predictive approach based on efficient feature selection and learning algorithms’ competition: Case of learners’ dropout in MOOCs. Educ Inf Technol. 2019;24(6):3591–618.
Baak M, Koopman R, Snoek H, Klous S. A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics. Comput Stat Data Anal. 2020;152: 107043.
Williamson S, Vijayakumar K, Kadam VJ. Predicting breast cancer biopsy outcomes from BI-RADS findings using random forests with chi-square and MI features. Multimed Tools Appl. 2022;81(26):36869–89.
Moorthy U, Gandhi UD. A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J Ambient Intell Humaniz Comput. 2021;12:3527–38.
Song XF, Zhang Y, Gong DW, Sun XY. Feature selection using bare-bones particle swarm optimization with mutual information. Pattern Recogn. 2021;112: 107804.
Gong L, Xie S, Zhang Y, Wang M, Wang X. Hybrid feature selection method based on feature subset and factor analysis. IEEE Access. 2022;10:120792–803.
Batool S, Rashid J, Nisar MW, Kim J, Kwon HY, Hussain A. Educational data mining to predict students’ academic performance: A survey study. Educ Inf Technol. 2023;28(1):905–71.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21:1–13.
Hussain S, Khan MQ. Student-performulator: Predicting students’ academic performance at secondary and intermediate level using machine learning. Ann Data Sci. 2023;10(3):637–55.
Iam-On N, Boongoen T. Improved student dropout prediction in Thai University using ensemble of mixed-type data clusterings. Int J Mach Learn Cybern. 2017;8:497–510.
Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. J Eng Appl Sci. 2017;12(16):4102–7.
Tomasevic N, Gvozdenovic N, Vranes S. An overview and comparison of supervised data mining techniques for student exam performance prediction. Comput Educ. 2020;143: 103676.
Holmgren SD, Boyles RR, Cronk RD, Duncan CG, Kwok RK, Lunn RM, Osborn KC, Thessen AE, Schmitt CP. Catalyzing knowledge-driven discovery in environmental health sciences through a community-driven harmonized language. Int J Environ Res Public Health. 2021;18(17):8985.
Al-Ashoor AHMED, Abdullah SHUBAIR. Examining techniques to solving imbalanced datasets in educational data mining systems. Int J Comput. 2022;21(2):205–13.
Alghamdi AS, Rahman A. Data mining approach to predict success of secondary school students: A Saudi Arabian case study. Educ Sci. 2023;13(3):293.
Alija S, Beqiri E, Gaafar AS, Hamoud AK. Predicting students performance using supervised machine learning based on imbalanced dataset and wrapper feature selection. Informatica, 2023;47(1).
Akter S, Habib A, Islam MA, Hossen MS, Fahim WA, Sarkar PR, Ahmed M. Comprehensive performance assessment of deep learning models in early prediction and risk identification of chronic kidney disease. IEEE Access. 2021;9:165184–206.
Alyahyan E, Düştegör D. Predicting academic success in higher education: literature review and best practices. Int J Educ Technol High Educ. 2020;17(1):3.
Wei G, Mu W, Song Y, Dou J. An improved and random synthetic minority oversampling technique for imbalanced data. Knowl-Based Syst. 2022;248: 108839.
Ahamed MF, Hossain MM, Nahiduzzaman M, Islam MR, Islam MR, Ahsan M, Haider J. A review on brain tumor segmentation based on deep learning methods with federated learning techniques. Comput Med Imaging Graph. 2023;102313.
Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: Experimental evaluation. Inf Sci. 2020;513:429–41.
Sarwar T, Seifollahi S, Chan J, Zhang X, Aksakalli V, Hudson I, Verspoor K, Cavedon L. The secondary use of electronic health records for data mining: Data characteristics and challenges. ACM Comput Surv (CSUR). 2022;55(2):1–40.
Fernández A, Garcia S, Herrera F, Chawla NV. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res. 2018;61:863–905.
Križanić S. Educational data mining using cluster analysis and decision tree technique: a case study. Int J Eng Bus Manage. 2020;12:1847979020908675.
Acknowledgements
We are thankful the Ministry of Science and Technology of the Republic of China, China Scholarship council, and Kunming University of Science and Technology for supporting us.
Funding
This study is supported in part by the Ministry of Science and Technology of the Republic of China under contract numbers MOST-109- 2511-H-011-002-MY3 and MOST-108-2511-H-011-005-MY3.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical Approval
The participants were protected by hiding their personal information in this study. They were voluntary and they knew that they could withdraw from the experiment at any time. The data can be provided upon request by sending e-mails to the corresponding author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dad, I., He, J., Noor, W. et al. Cross Classification Matrix to Evaluate the Performance of Machine Learning Algorithms in Predicting Students Performance of Developing Regions. SN COMPUT. SCI. 5, 621 (2024). https://doi.org/10.1007/s42979-024-02909-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02909-y