Academic Prediction

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Academic Prediction
Angelo Jaison Ashik V
Student, Computer science and Engineering Department Student, Computer science and Engineering Department
Sahrdaya College of Engineering Technology Sahrdaya College of Engineering Technology
Thrissur, India Thrissur, India
Aswin E B Daniyel Johnson

Student, Computer science and Engineering Department Student, Computer science and Engineering Department
Sahrdaya College of Engineering Technology Sahrdaya College of Engineering Technology
Thrissur, India Thrissur, India
Priya K V
Asst. Professor, Computer science and Engineering Department
Sahrdaya College of Engineering Technology
Thrissur, India
Abstract:- Every educational institute maintains a Predicting student performance is a crucial topic that
proper database on their student performance and EDM is looking at. This task predicts the value of an
activities. This information is incredibly useful in the unknown variable that characterises students in terms of
realm of education., particularly for evaluating the outcome (Pass/Fail), grades, and marks, among other things.
performance of students. It is true that evaluating The literature review for this study focuses on predicting
student performance has grown difficult due to the lack student attrition, failures, and success. The stakeholders in
of comparison between different resampling methods this area are all looking for an early warning system to
due to the imbalanced data sets in this discipline. Some predict learning at an early stage. Not only did this early
of these resampling models such as Random Forest, warning system reduce the cost of learning, but it also
Artificial Neural Network and Logistic Regression. reduced the amount of time and space needed.
Furthermore are compared in this paper and the model
validation we used here is 5-fold cross validation. These Because of technological advancements, predicting
resampling methods provide an accurate output on the student success has become an essential study issue. In this
current performance of students and state the variance field, unbalanced datasets have made it difficult to forecast
in their performance. This provides a reliable source to student performance, and there is no comparison between
view and check the performance of students. various resampling strategies. The majority of data on
student performance and activities is not used. These data
I. INTRODUCTION can be utilised to make accurate forecasts and evaluations of
students if a good baseline is created. To deal with the
Educational quality is essential to a country’s imbalanced data, this paper compares a few resampling
development. With the help of admissions systems, strategies such as Random Forest, Artificial Neural
academic information systems, learning management Network, and Logistic Regression. To fully verify the
systems, and e-learning, data in the education domain is efficiency of the resampling strategies, balanced datasets are
growing by the day. The information gathered from pupils is used to improve classifier performance. In addition, the 5-
typically utilised to answer simple questions and make fold cross-validation approach is employed to validate the
decisions. However, due to the complexity and enormous model.
size of the data sets, the majority of this data stays
unusable.As a result, analysing this vast amount of Students may be able to have a better idea of how well
educational data in order to predict student achievement is a or poorly they will perform in a course based on the
hot topic. Data mining, also known as knowledge discovery prediction results, and then take efforts to improve their
in databases, is the process of extracting meaningful performance. Any educational institution around the world
information from large collections of data (KDD). It's been has a long-term goal of increasing student retention.
used successfully in a variety of industries, including Increased retention has numerous benefits, including
finance, medical, and business, and it's now being employed improved college reputation, ranking, and job possibilities
in education under the label Educational Data Mining. for alumni, among others.
IJISRT21JUN296 www.ijisrt.com 368

ISSN No:-2456-2165
II. RELATED WORK contributes to a hybrid regression model that improves
the accuracy of predicting student academic
Here we look at some of the related works in the field performance, as measured by future grades in various
of student performance prediction. courses, as well as an optimised multi-label classifier
1) KWOK TAI CHUIL, RYAN WEN LIU,MINGBO that predicts qualitative values for the effect of multiple
ZHAO and PATRICIA ORDONEZ DE PABLOS– factors related to the student results received.
“Using a Deep Support Vector Machine based on a
Generative Adversarial Network to Predict Student III. PROPOSED SYSTEM
Performance with School and Family Tutoring,”
Automating student evaluation entails the development As we have seen there are quite a lot of different
and application of machine learning technologies to aid approaches to predicting student performance.. What we
in student analysis. Conditional GAN-related deep hope to achieve is a performance predicting system that has
support vector machine algorithm has been improved. all the latest features of any modern software and com-
ICGAN solves the problem of low data volume by bining it with machine learning to further enhance its
duplicating a new training dataset, whereas DSVM capabilities.
extends SVM from shallow to deep learning. The
ICGAN approach is given as a way to generate more Student performance is the most essential aspect in
data on training student performance. The DSVM is in determining the quality of a university in higher education.
charge of the performance prediction model for pupils. Because of its importance in decision making, EDM is
The collecting of data on pupils and their performance currently the most extensively used method for evaluating
can be done more consistently and accurately. and predicting student performance by researchers. There
2) HANAN ABDULLAH MENGASH– “Using Data are two key things to consider when predicting student
Mining Techniques to Predict Student Performance to performance: characteristics and prediction methods.
Assist University Admissions Decision Making,” This Student CGPA has been demonstrated to be the most
research aimed to assist universities in making admission commonly used indicator in predicting university
decisions by using data mining techniques to forecast performance. It has been utilised in numerous studies.
students' academic achievement before they were Assessments, quiz grades, lab work, and final exam marks
admitted. Using data mining classification approaches, are some of the other characteristics that researchers use to
we design and test four prediction models to predict predict student performance at university. Other factors such
early academic success among candidates based on their as extracurricular activities, student demographics, and
preadmission profiles. Artificial Neural Network (ANN), social contact networks have been used by a few study.
Decision Tree, Support Vector Machine (SVM), and
Naive Bayes are four well-known data mining Several data mining classification algorithms have
approaches. The university where this study was done been used to predict student performance in some of the
chose to adjust the weighting of its admission standards articles mentioned above. In one study, 505 eighth-semester
based on the findings and recommendations of this students' academic progress was predicted using ANN.
study. Using Decision Trees, the study created a way to predict
3) ABDULLAH ALSHANQITI AND ABDALLAH student achievement in specific courses using small student
NAMOUN DLT– “Comparing Different Resampling sample sizes. (32 and 42 students, respectively). In a study
Methods in Predicting Students’ Performance Using of 1,600 students, Naive Bayes was used to predict
Machine Learning Techniques,” The educational system achievement in a specific subject. SVM was used to predict
recognises the potential of data mining to significantly students at risk's performance in their first year of study on a
improve its performance. This paper compares different data set of 1,074 students in a research.
resampling techniques like Borderline To solve the
unbalanced data problem for forecasting students' According to a review of the literature, the majority of
performance using two different datasets, SMOTE, research do not look into employing numerous prediction
Random Over Sampler, SVM-SMOTE, SMOTE-ENN, models and using the most accurate results. This research
SMOTE, and SMOTE-Tomek were developed. Although aims to analyse the various resampling approaches for
data mining aids in knowledge discovery, it is not dealing with the unbalanced data problem in order to
without its drawbacks, machine learning algorithms determine the optimal technique and classifier for
gives the required tools for this purpose. Model forecasting student performance. This study also aims to
validation is done using 5-fold cross-validation, which look into the differences between multiclass and binary
divides the dataset into ve subsets and utilises one of the classification, as well as the significance of feature structure.
ve subsets as the testset and the other four subsets as the Models with fewer classes and nominal features perform
training set. better, according to the findings obtained using multiple
4) RAMIN GHORBANI AND ROUZBEH GHOUSI– assessment measures. Furthermore, classifiers do not
“Using Hybrid Regression and Multi-Label perform well when data is imbalanced, thus this problem
Classification to Predict Student Performance and Its must be addressed. A balanced dataset improves the
Influential Factors,” Students have a wide range of performance of classifiers.
characteristics and past behaviours, and using a single
model may result in erroneous predictions. This paper

ISSN No:-2456-2165
IV. METHODOLOGY approach. As a result, all of the features have the
conventional normal distribution properties of = 0 and 1,
This research aims to analyse various machine where is the average and is the standard deviation from the
learning approaches for dealing with balanced data in order average. Equ 1 defines the formula for scaling the values.
to determine the optimal methodology and classifier for
predicting student performance. Figure 1 depicts the steps of
the applied approach used to attain the paper's objectives.
D. Random Forest
Random Forest is a common algorithm for machine
learning that is part of the supervised learning process.It can
be used for problems in machine learning in both
classification and regression. It is based on the abstract of
ensemble learning, which is a process of combining
different classifiers to resolve a complex problem and to
improve the efficiency of the model. Random Forest is a
controlled machine learning approach that combines
Fig. 1: Architecture diagram
numerous decision trees on distinct subsets of the dataset
and averages them to improve the data set's forecasting
A. Data Collection
accuracy. Instead of relying on a single decision tree,
A student performance data set used in this study has
random forest gathers forecasts from all trees and forecasts
been collected from kaggle. It contains 1044 instances, each
the ultimate result based on the most votes.
with 33 properties, such as student grades, demographic,
social and school-related characteristics. The following are
E. Logistic Regression
the detailed description of attributes present in our dataset
In its most basic form, logistic regression is a
they are Sex, Age, School, Address, Parent status, Mother’s
statistical model that represents a binary dependent variable
education, Mother’ job, Father’s education, Father’s job,
with a logistic function, while there are many more complex
Student guarduia, Family size, Quality of family
extensions available. In regression analysis, logistic
relationship, Reason to choose school, Travel time, Study
regression estimates the parameters of a logistic model. The
time, Failures, School support, Family support, Activities,
logistic regression, like all regression studies, is a predictive
Tuition details, Internet, Nursery, Highers education,
analysis. Logistic regression, ordinal, interval, or ratio-level
Relationship status, Free time, Going out with friends,
independent variables are used to analyse data and explain
Alcohol consumption weekend, Workday alcohol
the relationship between one dependent binary variable and
consumption, Health status, Absences, First period grade,
one or more nominal variables. Logistic regressions are
Second period grade, Final grade. The dataset is
notoriously difficult to comprehend; but, using the
downloaded from an Open Source Kaggle kernel database.
Intellectus Statistics technique, you may quickly conduct the
The dataset attributes are independent of each other except
study and analyse the results in plain English.
the student’s final grade which depends on all other
attributes.
F. Artificial Neural Network
Artificial neural networks are comparable to biological
B. Data Preprocessing
neural networks in form, function, and data processing, and
Data preparation is one of the most important phases in
they are a relatively good methodology for solving
machine learning. This process converts the raw data into a
classification and prediction problems. ANN is a set of
format that can be understood. In the real world, datasets
mathematical models that can imitate a number of biological
contain several flaws; as a result, this phase can eliminate
neural system properties and are similar to adaptive human
the errors, making datasets easier to manage. Because the
learning.
datasets utilised in this study include no missing data,
treating missing data as a stage of data preparation is not
They are made up of a large number of linked neurons
necessary.
connected by connections that carry permeability (weight)
coefficients that are similar to synapses in function. Input
C. Feature Extraction
layer, one or more hidden layers, and output layer are the
Feature extraction, often known as data normalisation,
three levels in which the neurons are organised.
is a technique for normalising a dataset's range of
independent variables or features. The Euclidean distance
ANNs handle data in the same way that biological
between two data points is used in most machine learning
neural networks do, with the added capability of
models, hence they may not operate well without feature
remembering, learning, and correcting errors at a high rate,
extraction. Standardization, Mean Normalization, Min-Max
allowing neural networks to be utilised to solve complicated
Scaling, and Unit Vector are four prominent techniques to
tasks like classification and prediction. ANNs have been
implement Feature extraction. The student performance
effectively employed to model complicated and real-world
dataset values used in this study span a wide range. To
situations in a variety of areas.
rescale the features, this research used the Standardization

ISSN No:-2456-2165
G. Model Validation I. Use Case Diagram
Cross-validation is a model validation approach for Figure 3 shows the use case diagram of our project. It
determining how statistical analysis results are generalised shows the different users possible interaction with our
within a single dataset. This research uses precision data system. There are two users : user and administrator both
from shuffle 5-fold cross-validation for each resampling have equal interaction with the system but only the
procedure. This divides the dataset into five subgroups, with administrator has administrative power over the system..
one of the subsets serving as the assessment set and the The user group has a student and teacher and the student can
other four serving as the training set. Set each time and then input their details and make predictions. The teacher and
repeat the process five times more. inspect the individual results of students.
The results of shuffle 5-fold cross-validation are more

reliable and appropriate because of the way this
methodology works. After addressing the unbalanced data
query, the results show that several of the model accuracies
have improved slightly. Regarding the output of classifiers
utilising other classifiers and resampling techniques, it
should be mentioned that Random Forest has performed
admirably in almost all balanced datasets.
H. Flowchart
Figure 2 represents the flowchart of our project. The
flowchart shows you faces of our project training phase and
prediction phase. In the training phase the data is collected
and it is preprocessed and feature selected to obtain a clean
dataset for the training phase. The normalised data is then
trained by three different algorithms. The next is the
prediction phase. During this phase an input of a student
with relevant details is collected for the purpose of
prediction; the collected details are also feature extracted to
make it compatible for ML algorithms. The output will be in
the form of a text file which has the prediction result of the
student in percentage and a description of whether he/she
has passed or not.
Fig. 3: Use case diagram
V. RESULT
Data Mining is extremely helpful, especially for

analyzing students’ performance. Our project helps us to get
a futuristic information of students whether he/she may pass
or fail in the upcoming series of assessment. Our system
provides the result in the form of text which is generated
after the prediction phase; it includes the predicted
percentage of his/her next series assessment. It also has a
textual output describing whether he/she will be passed or
failed in the form of (Pass/Fail) this helps the student to
foresee their future and be prepared for their upcoming
series assessment.
It also has a feature that recommends to the student

what all things they need to do for improving their results in
Fig. 2: Flowchart future. This recommendation module is created by
comparing the students feature extracted value with the

ISSN No:-2456-2165
clothes students values by comparing this the system gets an [4]. HANAN ABDULLAH MENGASH, Using Data
overview of what needs to be improved for increasing their Mining Techniques to Predict Student Performance to
results. This is a useful feature for students who get “fail” as Support Decision Making in University Admission
prediction output. Systems March 30, 2020. Digital Object Identifier
10.1109/ACCESS.2020.2981905 Riyadh 11351, Saudi
VI. CONCLUSION Arabia
[5]. AGORITSA POLYZOU AND GEORGE KARYPIS,
Predicting student performance is one of the most Feature Extraction for Next-Term Prediction of Poor
important study subjects that should be investigated right Student Performance IEEE TRANSACTIONS ON
now. Data Mining is tremendously useful in the realm of LEARNING TECHNOLOGIES, VOL. 12, NO. 2,
education, particularly for analysing student performance. APRIL-JUNE 2019
Because of the imbalance in data sets in this field, [6]. SANA BHUTTO,DR. QASIM ALI
forecasting students' performance has become a huge ARAIN,MALEEHA ANWAR, Predicting Students'
challenge, and there is no comparison between different Academic Performance Through Supervised Machine
machine learning algorithms. The project will analyze the Learning 2020 International Conference on
given datasets and perform various ML algorithms, compare Information Science and Communication Technology
their outputs and give the most accurate results. It's worth [7]. Y.-H. HU, C.-L. LO, AND S.-P. SHIH, ‘‘Developing
noting that two separate datasets linked to student early warning systems to predict students’ online
performance are employed, as well as the differences learning performance,’’ Comput. Hum. Behav., vol.
between multiclass and binary classification and feature 36, pp. 469–478, Jul. 2014.
structure. To improve the conclusion of resampling [8]. M.ZIĘBA, S. K. TOMCZAK, AND J. M.
approaches, several classifiers can be utilised. On the TOMCZAK, ‘‘Ensemble boosted trees with synthetic
imbalanced dataset, all of the classifiers are first run using features generation in application to bankruptcy
the random hold-out approach. The results reveal that when prediction,’’ Expert Syst. Appl., vol. 58, pp. 93–101,
dealing with unbalanced data, classifiers are unable to make Oct. 2016.
correct predictions and are unable to predict some of the
classes at all. Furthermore, the findings collected using
various assessment criteria show that having fewer classes
leads to greater performance with machine learning models.
This research can be expanded in a variety of ways,

and future work could go in the following lines. For a better
comparison and improved performance, new ensemble and
hybrid classifiers could be developed. Additionally, feature
selection approaches can be used to improve model results
and gain a better understanding of the important features.
REFERENCES
[1]. KWOK TAI CHUIL, RYAN WEN LIU,MINGBO

ZHAO and PATRICIA ORDONEZ DE PABLOS,
PredictingStudents' Performance with School and
Family Tutoring using Generative Adversarial
Network based Deep SupportVector Machine Digital
Object Identifier 10.1109/ACCESS.2019.Doi Number
Shanghai 200051, China
[2]. ABDULLAH ALSHANQITI AND ABDALLAH
NAMOUN, Predicting Student Performance and Its
Influential Factors Using Hybrid Regression and
Multi-Label Classification Digital Object Identifier
November19,2020.Digital Object Identifier
10.1109/ACCESS.2020.3036572 Madinah 42351,
Saudi Arabia
[3]. RAMIN GHORBANI AND ROUZBEH GHOUS,[1]
Comparing Different Resampling Methods in
Predicting Students’ Performance Using Machine
Learning Techniques Digital Object Identifier
10.1109/ACCESS.2020 April 22, 2020. Digital Object
Identifier 10.1109/ACCESS.2020.2986809 Tehran
16846-13114, Iran

Academic Prediction

Uploaded by

Copyright:

Available Formats

Academic Prediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Academic Prediction

Uploaded by

Copyright:

Available Formats

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology

Aswin E B Daniyel Johnson

IJISRT21JUN296 www.ijisrt.com 368

IJISRT21JUN296 www.ijisrt.com 369

IJISRT21JUN296 www.ijisrt.com 370

The results of shuffle 5-fold cross-validation are more

Fig. 3: Use case diagram

Data Mining is extremely helpful, especially for

It also has a feature that recommends to the student

IJISRT21JUN296 www.ijisrt.com 371

This research can be expanded in a variety of ways,

[1]. KWOK TAI CHUIL, RYAN WEN LIU,MINGBO

IJISRT21JUN296 www.ijisrt.com 372

You might also like