Data Mining Approach To Predict Academic Performance of Students
Data Mining Approach To Predict Academic Performance of Students
Data Mining Approach To Predict Academic Performance of Students
METHODS
*Correspondence:
Partha Ghosh,
parth_ghos@rediffmail.com
Powerful data mining techniques are available in a variety of educational fields. Educational research is
advancing rapidly due to the vast amount of student data that can be used to create insightful patterns
related to student learning. Educational data mining is a tool that helps universities assess and identify student
performance. Well-known classification techniques have been widely used to determine student success in
data mining. A decisive and growing exploration area in educational data mining (EDM) is predicting student
academic performance. This area uses data mining and automaton learning approaches to extract data from
education repositories. According to relevant research, there are several academic performance prediction
methods aimed at improving administrative and teaching staff in academic institutions. In the put-forwarded
approach, the collected data set is preprocessed to ensure data quality and labeled student education data
is used to apply ANN classifiers, support vector classifiers, random forests, and DT Compute and train a
classifier. The achievement of the four classifications is measured by accuracy value, receiver operating curve
(ROC), F1 score, and confusion matrix scored by each model. Finally, we found that the top three algorithmic
models had an accuracy of 86–95%, an F1 score of 85–95%, and an average area under ROC curve of
OVA of 98–99.6%.
Keywords: predictive analysis, KNN, SVM, random forest, DT classifier, students’ academic performance
39
40 Ghosh et al.
– We want to analyze our model’s performance using 14. Studytime – total amount of time spent studying
various performance metrics. each week (in hours: 0: 1, 1: 2, 2: 5, 3: 5, or 4: > 10)
– In-depth analysis considers the impact of all 15. Failures – this is the total number of failures in the
classifiers to determine the best classifier to predict previous class (number: n if 1 = n3, otherwise 4)
student performance. 16. schoolsup – continuing education support (binary:
yes or no)
17. Family Education Support Program (famsup)
3.2. Data set (binary: yes or no)
18. Paid – additional paid tuition (binary: yes or no) for
• The information utilized in this lesson was gathered course subjects (mathematics or Portuguese)
from two community schools in Portugal’s Alentejo 19. After-school activities (binary: yes or no)
area in the academic year 2005–2006. 20. Nursery – attended a preschool (binary: yes or no)
• The database was built from following sources: 21. Higher – intends to further their study (binary: yes
or no)
– Paper-based reports for school 22. Access to the Internet at home (number 22) (yes or
– Characteristics (the absence and three-period no in binary)
grades) 23. Romantic – engaged in a romantic partnership
– Questionnaires, which are used to supplement the (binary: yes or no)
earlier data 24. famrel – strength of kinship bonds (numeric: from
1 – very bad to 5 – excellent)
• Characteristic Information: 25. Especially after school, free time (count: 1 – very
low to 5 – very high)
Both the student-mat.csv (for a math course) and student- 26. Fun with friends on the go (number: 1 – very low to
por.csv (for a Portuguese language course) databases include 5 – very high)
the same attributes: 27. Darc – alcohol consumption at work (number: 1 –
very little to 5 – very high)
1. School (binary: “GP” for Gabriel Pereira or “MS” for 28. Weekend alcohol consumption statistics (number:
Mousinho da Silveira) 1 – very low to 5 – very high)
2. Student’s gender (binary: “F” – female or “M” – male) 29. Health – current state of health (number: 1 – very
3. Age of the student, 3 (numerical: 15–22) bad to 5 – very good)
4. Address – the address of the student residence (in the 30. Absences – total number of days absent from school
city or in the countryside, in binary format) (number: 0–93). # These numbers relate to math or
5. Famsize – family size (binary: “GT3” – higher than 3 Portuguese classes.
or “LE3” – fewer than 3) 31. G1 – first-grade point (numeric: 0–20)
6 Pstatus: the parents’ status of cohabitation (binary: 32. G2 is the second-class (numeric: 0–20)
“T” for “living together” or “A” for “apartment”) 33. G3 – final grade (numeric: 0–20, output target)
7. Medu – maternity education (number: 0 – none,
1 – primary school (grade), 2 – "grades 5–9", 3 –
“secondary education” or “further education”) 3.3. Methodology
8 Fedu – father’s education (count: 0 – none, 1 –
primary education (grade 4), 2 – grades 5–9, 3 – The proposed model (Figure 1) is structured for the analysis
secondary, or 4 –continuing education) and evaluation of PIDD. In our model, we first import the
9. Mjob stands for mother’s job and means “education,” specified data set. Then, we use different data visualization
“medical care,” “public service” (police, government, techniques:
etc.), “home,” or “other”
– Histogram (to check count of student receiving final
10. Fjob – father’s occupation (nominative: “education,”
grade for each age)
“medical,” “public service” (government, police, etc.),
– Count plot (to compare the count of students with
“household,” or “other”)
different attributes)
11. Proximity to this school choice justification
– Box plot (to check if any outliers are present in the data)
(nominally: “home,” school “call,” desired course, or
“other”) Based on the nature of data, we perform data
12. Parent – the person responsible for the student preprocessing by removing the outliers in the data. Next, we
(nominative: “mother„ “father,” or “other”) divide the data set into test and training data sets. Then we
13. Travel time from house to school (in minutes: 1: train the data set individually on 4 different classification
15 min, 2: 15–30 min, 3: 1 h, or 4: more than an hour) algorithms:
42 Ghosh et al.
– Accuracy measure
– ROC score
– F1 score
FIGURE 6 | Boxplot showing final grade (G3) vs. age.
Finally, we compare the analysis based on accuracy and
obtain the final results.
were collected. As can be seen in the figure, there is a large
group of students whose age is evaluated as 0. Since these
3.4. Data preprocessing students have no age record, they were replaced as 0 in order
to eliminate null.
3.4.1. An overview of the data set Figure 3 shows whether or not any attribute of the data
Figure 2 represents the final grade (G3) of data set vs. set contains any null value. The uniform cyan color confirms
count of students in each age group for whom the data that none of the attributes has any null values.
10.54646/bijscit.2023.35 43
3.5. Algorithms
FIGURE 7 | Boxplot showing final grade (G3) vs. higher education. In this work, tests were performed using classifiers such
as SVMs, ANNs, decision trees, and random forests. These
classifiers evaluate the metrics of the data set. Pattern
recognition and nonlinear function estimation problems are
solved with SVM. Training data are nonlinearly represented
in a high-dimensional feature space using support vector
machines. This helps to build a separating hyper-plane with
the widest possible margins, resulting in a nonlinear decision
boundary in the input space. A quadratic programming
problem with a global solution provides a support vector
machine solution.
FIGURE 11 | Count plot for comparison of probability of outcome as grade X vs. All for decision tree.
FIGURE 13 | Count plot for comparison of probability of outcome as grade X vs. all for KNN.
with high dimensions. It avoids overfitting problems and 4. Experimental results and
improves model accuracy.
Regression and classification problems may
discussion
be both accomplished using random forests, but
We have applied all the above-mentioned classifiers
they are not well suited for applications that
on the data set one by one and have calculated the
require regression.
performance metrics.
i) Decision Tree
Confusion matrix for test and train data set
3.6. Performance metrics used
FIGURE 15 | Count plot for comparison of probability of outcome as grade X vs. all for SVC.
Three of the six grade 1 kids in the test group were accurately
categorized as being in grade 1, while the other three were
categorized as being in grade 2.
All 19 of the grade 2 kids were appropriately categorized as
being in grade 2.
The 30 third-grade pupils were divided into 29 who were
accurately identified as such, one who was misclassified
10.54646/bijscit.2023.35 47
FIGURE 17 | Count plot for comparison of probability of outcome as grade X vs. all for random forest.
4.3. Conclusion from the train data’s TABLE 1 | Comparison between performances of different classifier.
confusion matrix Classifier Name F1 score Accuracy Area under ROC curve
Four of the seven grade 1 kids in the test group were Decision tree 0.90 0.90 0.91
accurately categorized as being in grade 1, while three were K-Nearest neighbor 0.85 0.86 0.98
SVM 0.95 0.95 0.996
appropriately categorized as being in grade 2.
Random forest 0.80 0.83 0.98
Of the 61 grade 2 pupils, 54 were accurately categorized
as being in that grade, while 7 were misclassified as
being in that grade. 4.4. Conclusion from the test data’s
Of the 117 third-grade pupils, 107 were accurately confusion matrix
identified as such, while 2 were misclassified as second
graders and another 8 as fourth graders. One of the six grade 1 pupils in the test group was
Of the 115 grade 4 pupils, 106 were categorized properly as accurately categorized as in grade 1, while the other five were
being in grade 4 and 9 were misclassified as being in grade 3. categorized as in grade 2.
48 Ghosh et al.
Of the 19 kids in grade 2, 18 were appropriately identified Of the 117 third-grade pupils, 111 were accurately
as such, while 1 was mistakenly placed in grade 3. identified as such, while 2 were misclassified as second
Of the 30 third-grade pupils, 28 were accurately identified graders and another 4 as fourth graders.
as such, while 1 was misclassified as a second grader and the Of the 115 grade 4 pupils, 113 were accurately categorized
remaining 3 as a fourth grader. as being in grade 4, while 2 were mistakenly categorized as
Of the 38 fourth-grade pupils, 35 were accurately identified being in grade 3.
as such, while 3 others were misidentified as such. Count plot
for comparison of probability of outcome as grade X vs. all for
KNN is shown in Figure 13. And ROC OVR of the 4 grades
using KNN is shown in Figure 14. 4.6. Conclusion from the test data’s
iii) SVM confusion matrix
Confusion matrix for test and train data set
Of the six grade 1 pupils tested, four were
accurately categorized as in grade 1, while two were
categorized as in grade 2.
All 19 of the grade 2 children were indeed
placed in that grade.
Notably, 32 pupils in grade 3 were evaluated, of whom
29 were accurately identified as such, while 3 were
misclassified as such.
Out of the 38 grade 4 students, all were correctly classified
as grade 4. Count plot for comparison of probability of
outcome as grade X vs. all for SVC is shown in Figure 15. And
ROC OVR of the 4 grades using SVC is shown in Figure 16.
iv) Random Forest