Academic Prediction
Academic Prediction
Academic Prediction
ISSN No:-2456-2165
Academic Prediction
Angelo Jaison Ashik V
Student, Computer science and Engineering Department Student, Computer science and Engineering Department
Sahrdaya College of Engineering Technology Sahrdaya College of Engineering Technology
Thrissur, India Thrissur, India
Priya K V
Asst. Professor, Computer science and Engineering Department
Sahrdaya College of Engineering Technology
Thrissur, India
Abstract:- Every educational institute maintains a Predicting student performance is a crucial topic that
proper database on their student performance and EDM is looking at. This task predicts the value of an
activities. This information is incredibly useful in the unknown variable that characterises students in terms of
realm of education., particularly for evaluating the outcome (Pass/Fail), grades, and marks, among other things.
performance of students. It is true that evaluating The literature review for this study focuses on predicting
student performance has grown difficult due to the lack student attrition, failures, and success. The stakeholders in
of comparison between different resampling methods this area are all looking for an early warning system to
due to the imbalanced data sets in this discipline. Some predict learning at an early stage. Not only did this early
of these resampling models such as Random Forest, warning system reduce the cost of learning, but it also
Artificial Neural Network and Logistic Regression. reduced the amount of time and space needed.
Furthermore are compared in this paper and the model
validation we used here is 5-fold cross validation. These Because of technological advancements, predicting
resampling methods provide an accurate output on the student success has become an essential study issue. In this
current performance of students and state the variance field, unbalanced datasets have made it difficult to forecast
in their performance. This provides a reliable source to student performance, and there is no comparison between
view and check the performance of students. various resampling strategies. The majority of data on
student performance and activities is not used. These data
I. INTRODUCTION can be utilised to make accurate forecasts and evaluations of
students if a good baseline is created. To deal with the
Educational quality is essential to a country’s imbalanced data, this paper compares a few resampling
development. With the help of admissions systems, strategies such as Random Forest, Artificial Neural
academic information systems, learning management Network, and Logistic Regression. To fully verify the
systems, and e-learning, data in the education domain is efficiency of the resampling strategies, balanced datasets are
growing by the day. The information gathered from pupils is used to improve classifier performance. In addition, the 5-
typically utilised to answer simple questions and make fold cross-validation approach is employed to validate the
decisions. However, due to the complexity and enormous model.
size of the data sets, the majority of this data stays
unusable.As a result, analysing this vast amount of Students may be able to have a better idea of how well
educational data in order to predict student achievement is a or poorly they will perform in a course based on the
hot topic. Data mining, also known as knowledge discovery prediction results, and then take efforts to improve their
in databases, is the process of extracting meaningful performance. Any educational institution around the world
information from large collections of data (KDD). It's been has a long-term goal of increasing student retention.
used successfully in a variety of industries, including Increased retention has numerous benefits, including
finance, medical, and business, and it's now being employed improved college reputation, ranking, and job possibilities
in education under the label Educational Data Mining. for alumni, among others.
D. Random Forest
Random Forest is a common algorithm for machine
learning that is part of the supervised learning process.It can
be used for problems in machine learning in both
classification and regression. It is based on the abstract of
ensemble learning, which is a process of combining
different classifiers to resolve a complex problem and to
improve the efficiency of the model. Random Forest is a
controlled machine learning approach that combines
Fig. 1: Architecture diagram
numerous decision trees on distinct subsets of the dataset
and averages them to improve the data set's forecasting
A. Data Collection
accuracy. Instead of relying on a single decision tree,
A student performance data set used in this study has
random forest gathers forecasts from all trees and forecasts
been collected from kaggle. It contains 1044 instances, each
the ultimate result based on the most votes.
with 33 properties, such as student grades, demographic,
social and school-related characteristics. The following are
E. Logistic Regression
the detailed description of attributes present in our dataset
In its most basic form, logistic regression is a
they are Sex, Age, School, Address, Parent status, Mother’s
statistical model that represents a binary dependent variable
education, Mother’ job, Father’s education, Father’s job,
with a logistic function, while there are many more complex
Student guarduia, Family size, Quality of family
extensions available. In regression analysis, logistic
relationship, Reason to choose school, Travel time, Study
regression estimates the parameters of a logistic model. The
time, Failures, School support, Family support, Activities,
logistic regression, like all regression studies, is a predictive
Tuition details, Internet, Nursery, Highers education,
analysis. Logistic regression, ordinal, interval, or ratio-level
Relationship status, Free time, Going out with friends,
independent variables are used to analyse data and explain
Alcohol consumption weekend, Workday alcohol
the relationship between one dependent binary variable and
consumption, Health status, Absences, First period grade,
one or more nominal variables. Logistic regressions are
Second period grade, Final grade. The dataset is
notoriously difficult to comprehend; but, using the
downloaded from an Open Source Kaggle kernel database.
Intellectus Statistics technique, you may quickly conduct the
The dataset attributes are independent of each other except
study and analyse the results in plain English.
the student’s final grade which depends on all other
attributes.
F. Artificial Neural Network
Artificial neural networks are comparable to biological
B. Data Preprocessing
neural networks in form, function, and data processing, and
Data preparation is one of the most important phases in
they are a relatively good methodology for solving
machine learning. This process converts the raw data into a
classification and prediction problems. ANN is a set of
format that can be understood. In the real world, datasets
mathematical models that can imitate a number of biological
contain several flaws; as a result, this phase can eliminate
neural system properties and are similar to adaptive human
the errors, making datasets easier to manage. Because the
learning.
datasets utilised in this study include no missing data,
treating missing data as a stage of data preparation is not
They are made up of a large number of linked neurons
necessary.
connected by connections that carry permeability (weight)
coefficients that are similar to synapses in function. Input
C. Feature Extraction
layer, one or more hidden layers, and output layer are the
Feature extraction, often known as data normalisation,
three levels in which the neurons are organised.
is a technique for normalising a dataset's range of
independent variables or features. The Euclidean distance
ANNs handle data in the same way that biological
between two data points is used in most machine learning
neural networks do, with the added capability of
models, hence they may not operate well without feature
remembering, learning, and correcting errors at a high rate,
extraction. Standardization, Mean Normalization, Min-Max
allowing neural networks to be utilised to solve complicated
Scaling, and Unit Vector are four prominent techniques to
tasks like classification and prediction. ANNs have been
implement Feature extraction. The student performance
effectively employed to model complicated and real-world
dataset values used in this study span a wide range. To
situations in a variety of areas.
rescale the features, this research used the Standardization
H. Flowchart
Figure 2 represents the flowchart of our project. The
flowchart shows you faces of our project training phase and
prediction phase. In the training phase the data is collected
and it is preprocessed and feature selected to obtain a clean
dataset for the training phase. The normalised data is then
trained by three different algorithms. The next is the
prediction phase. During this phase an input of a student
with relevant details is collected for the purpose of
prediction; the collected details are also feature extracted to
make it compatible for ML algorithms. The output will be in
the form of a text file which has the prediction result of the
student in percentage and a description of whether he/she
has passed or not.
V. RESULT
REFERENCES