Nothing Special   »   [go: up one dir, main page]

COVID-19 Future Forecasting Using Supervised Machi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

COVID-19 Future Forecasting Using


Supervised Machine Learning Models
FURQAN RUSTAM1 , AIJAZ AHMAD RESHI2 , ARIF MEHMOOD3 , SALEEM ULLAH1 , BYUNG-
WON ON4 , WAQAR ASLAM3 AND GYU SANG CHOI5
1
Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, 64200, Pakistan
2
The College of Computer Science and Engineering, Department of Computer Science, Taibah University Al Madinah Al Munawarah Saudia Arabia
3
Department of Computer Science & IT, The Islamia University of Bahawalpur, Punjab 63100, Pakistan;
4
Department of Software Convergence Engineering, Kunsan National University, Gunsan 54150, Korea
5
Department of Information & Communication Engineering, Yeungnam University, Gyeongbuk 38541, Korea
Corresponding authors: Byung-Won On (email: bwon@kunsan.ac.kr) and Arif Mehmood (email: arifnhmp@gmail.com)
This research was partially supported by National Research of Korea (NRF) grant funded by Korea government (MSIT) (No.
NRF-2019R1F1A1060752).

ABSTRACT Machine learning (ML) based forecasting mechanisms have proved their significance to
anticipate in perioperative outcomes to improve the decision making on the future course of actions. The ML
models have long been used in many application domains which needed the identification and prioritization
of adverse factors for a threat. Several prediction methods are being popularly used to handle forecasting
problems. This study demonstrates the capability of ML models to forecast the number of upcoming patients
affected by COVID-19 which is presently considered as a potential threat to mankind. In particular, four
standard forecasting models, such as linear regression (LR), least absolute shrinkage and selection operator
(LASSO), support vector machine (SVM), and exponential smoothing (ES) have been used in this study to
forecast the threatening factors of COVID-19. Three types of predictions are made by each of the models,
such as the number of newly infected cases, the number of deaths, and the number of recoveries in the next
10 days. The results produced by the study proves it a promising mechanism to use these methods for the
current scenario of the COVID-19 pandemic. The results prove that the ES performs best among all the used
models followed by LR and LASSO which performs well in forecasting the new confirmed cases, death rate
as well as recovery rate, while SVM performs poorly in all the prediction scenarios given the available
dataset.

INDEX TERMS COVID-19, exponential smoothing method, future forecasting, Adjusted R2 score,
supervised machine learning

I. INTRODUCTION weather forecasting, disease forecasting, stock market fore-


casting as well as disease prognosis. Various regression and
ACHINE learning (ML) has proved itself as a promi-
M nent field of study over the last decade by solving
many very complex and sophisticated real-world problems.
neural network models have wide applicability in predicting
the conditions of patients in the future with a specific disease
[3]. There are lots of studies performed for the prediction of
The application areas included almost all the real-world do-
different diseases using machine learning techniques such as
mains such as healthcare, autonomous vehicle (AV), business
coronary artery disease [4], cardiovascular disease prediction
applications, natural language processing (NLP), intelligent
[5], and breast cancer prediction [6]. In particular, the study
robots, gaming, climate modeling, voice, and image pro-
[7] is focused on live forecasting of COVID-19 confirmed
cessing. ML algorithms’ learning is typically based on trial
cases and study [8] is also focused on the forecast of COVID-
and error method quite opposite of conventional algorithms,
19 outbreak and early response. These prediction systems can
which follows the programming instructions based on de-
be very helpful in decision making to handle the present sce-
cision statements like if-else [1]. One of the most signifi-
nario to guide early interventions to manage these diseases
cant areas of ML is forecasting [2], numerous standard ML
very effectively.
algorithms have been used in this area to guide the future
course of actions needed in many application areas including This study aims to provide an early forecast model for the

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

spread of novel coronavirus, also known as SARS-CoV-2, increases the model performances improve.
officially named as COVID-19 by the World Health Orga- •ML model based forecasting can be very useful for
nization (WHO) [9]. COVID-19 is presently a very serious decision-makers to contain pandemics like COVID-19.
threat to human life all over the world. At the end of 2019, The rest of the paper consists of six sections. Section I
the virus was first identified in a city of China called Wuhan, presents the introduction, section II contains the description
when a large number of people developed symptoms like of the dataset and methods used in this study. Section III
pneumonia [10]. It has a diverse effect on the human body, presents the methodology, section IV presents the results, and
including severe acute respiratory syndrome and multi-organ section V summarizes the paper and presents the conclusion.
failure which can ultimately lead to death in a very short du-
ration [11]. Hundreds of thousands of people are affected by II. MATERIALS & METHODS
this pandemic throughout the world with thousands of deaths A. DATASET
every coming day. Thousands of new people are reported The aim of this study is the future forecasting of COVID-
to be positive every day from countries across the world. 19 spread focusing on the number of new positive cases,
The virus spreads primarily through close person to person the number of deaths, and the number of recoveries. The
physical contacts, by respiratory droplets, or by touching the dataset used in the study has been obtained from the GitHub
contaminated surfaces. The most challenging aspect of its repository provided by the Center for Systems Science and
spread is that a person can possess the virus for many days Engineering, Johns Hopkins University [12]. The repository
without showing symptoms. The causes of its spread and was primarily made available for the visual dashboard of
considering its danger, almost all the countries have declared 2019 Novel Coronavirus by the university and was sup-
either partial or strict lockdowns throughout the affected ported by the ESRI Living Atlas Team. Dataset files are
regions and cities. Medical researchers throughout the globe contained in the folder on the GitHub repository named
are currently involved to discover an appropriate vaccine (csse_covid_19_time_series). The folder contains daily time
and medications for the disease. Since there is no approved series summary tables, including the number of confirmed
medication till now for killing the virus so the governments cases, deaths, and recoveries. All data are from the daily
of all countries are focusing on the precautions which can case report and the update frequency of data is one day.
stop the spread. Out of all precautions, "be informed" about Data samples from the files are shown in Tables 1, 2, 3
all the aspects of COVID-19 is considered extremely impor- respectively.
tant. To contribute to this aspect of information, numerous
researchers are studying the different dimensions of the pan- TABLE 1: COVID-19 patient death cases time-series world-
demic and produce the results to help humanity. wide
To contribute to the current human crisis our attempt in this Province Country Lat Long 1/22/20 1/23/20 . . . 1/27/20
study is to develop a forecasting system for COVID-19. The /State /Region
forecasting is done for the three important variables of the Northern Australia -12.46 130.84 0 0 ... 0
Territory
disease for the coming 10 days: 1) the number 0f New con- Diamond Canada 0.000 0.000 0 0 ... 1
firmed cases. 2) the number of death cases 3) the number of Princess
recoveries. This problem of forecasting has been considered NaN Algeria 28.03 1.65 0 0 ... 19
as a regression problem in this study, so the study is based
on some state-of-art supervised ML regression models such
TABLE 2: COVID-19 new confirmed cases time-series
as linear regression (LR), least absolute shrinkage and selec-
worldwide
tion operator (LASSO), support vector machine (SVM), and
exponential smoothing (ES). The learning models have been Province Country Lat Long 1/22/20 1/23/20 . . . 1/27/20
trained using the COVID-19 patient stats dataset provided by /State /Region
NaN Afghan 33.00 65.00 0 0 ... 74
Johns Hopkins. The dataset has been preprocessed and di- Victoria Australia -37. 81 144. 96 0 0 ... 411
vided into two subsets: training set (85% records) and testing NaN Algeria 28.03 1.65 0 0 ... 264
set (15% records). The performance evaluation has been done
in terms of important measures including R-squared score
(R2 score), Adjusted R-squared Score (Radjusted 2
), mean TABLE 3: COVID-19 recovery cases time-series worldwide
square error (MSE), mean absolute error (MAE), and root Province Country Lat Long 1/22/20 1/23/20 . . . 1/27/20
mean square error (RMSE). /State /Region
Colombia Canada 49. 28 -123. 1 0 0 ... 4
This study has some key findings which are listed below: Victoria Australia -37. 81 144. 96 0 0 ... 70
• ES performs best when the time-series dataset has very NaN Algeria 28.03 1.65 0 0 ... 65
limited entries.
• Different ML algorithms seem to perform better in
different class predictions. B. SUPERVISED MACHINE LEARNING MODELS
• Most of the ML algorithms require an ample amount A supervised learning model is built to make a prediction
of data to predict the future, as the size of the dataset when it is provided with an unknown input instance. Thus
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

in this learning technique, the learning algorithm takes a 2) LASSO


dataset with input instances along with their corresponding LASSO is a regression model belongs to the linear regres-
regressor to train the regression model. The trained model sion technique which uses shrinkage [15]. Shrinkage in this
then generates a prediction for the given unforeseen input context refers to the shrinking of extreme values of a data
data or test dataset [13]. This learning method may use sample towards central values. The shrinkage process thus
regression techniques and classification algorithms for pre- makes LASSO better and more stable and also reduces the
dictive models’ development error [16]. LASSO is considered as a more suitable model
Four regression models have been used in this study of for multicollinearity scenarios. Since the model performs L1
COVID-19 future forecasting: regularization and the penalty added in this case is equal to
• Linear Regression the magnitude of coefficients. So LASSO makes the regres-
• LASSO Regression sion simpler in terms of the number of features it is using.
• Support Vector Machine It uses a regularization method for automatically penalizing
• Exponential Smoothing the extra features. That is, the features that cannot help the
regression results enough can be set to a very small value
1) Linear Regression potentially zero.
In regression modeling, a target class is predicated on the An ordinary multivariate regression uses all the features
independent features [14]. This method can be thus used to available to it and will assign each one a coefficient of
find out the relationship between independent and dependent regression. In contrast, the LASSO regression attempts to add
variables and also for forecasting. Linear regression a type of them one at a time and if the new feature does not improve
regression modeling is the most usable statistical technique the fit enough to out-way the penalty term by including that
for predictive analysis in machine learning. Each observation feature then it could not be added meaning as zero. Thus the
in linear regression depends on two values, one is the de- power of regularization by applying the penalty term for the
pendent variable and the second is the independent variable. extra features is that it can automatically do the selection for
Linear regression determines a linear relationship between us. Thus the models are made sparse with few coefficients
these dependent and independent variables. There are two in this case of regularization since the process eliminates the
factors (x, y) that are involved in linear regression analysis. coefficients when their values are equal to zero. That means
The equation below shows how y is related to x known as LASSO regression works on an objective to minimize the
regression. following:

y = β0 + β1 x + ε (1) n p
X X X
or equivalently (yi − xij βj )2 + λ |βj | (5)
i=1 j j=1

E(y) = β0 + β1 x (2)
It sets the coefficient, which can be interpreted as min( sum
Here, ε is the error term of linear regression. The error term of square residuals + λ |slope|), where, λ |slope| is penalty
here uses to account the variability between both x and y, β0 term.
represents y-intercept, β1 represents slope.
To put the concept of linear regression in the machine
3) Support Vector Machine
learning context, in order to train the model x is represented
as input training dataset, y represents the class labels present A support vector machine (SVM) is a type of supervised ML
in the input dataset. The goal of the machine learning algo- algorithm used for both regression and classification [17],
rithm then is to find the best values for β0 (intercept) and [18]. SVM regression being a non-parametric technique de-
β1 (coefficient) to get the best-fit regression line. To get the pends on a set of mathematical functions. The set of functions
best fit implies the difference between the actual values and called kernel transforms the data inputs into the desired form.
predicted values should be minimum, so this minimization SVM solves the regression problems using a linear function,
problem can be represented as: so while dealing with problems of non-linear regression,
it maps the input vector(x) to n-dimensional space called
n
1X a feature space (z). This mapping is done by non-linear
minimize (predi − yi )2 (3)
n i=1 mapping techniques after that linear regression is applied to
space. Putting the concept in ML context with a multivariate
n
1X training dataset (xn ) with N number of observations with yn
g= (predi − yi )2 (4) as a set of observed responses. The linear function can be
n i=1
depicted as:
Here, g is called a cost function, which is the root mean
square of the predicted value of y (predi ) and actual y (yi ),
n is the total number of data points. f (x) = x0 β + b (6)
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

The objective is to make it as flat as possible thus to find curve. The primary difference between R2 and Radjusted 2
is
the value of f (x) with (β 0 β) as minimal norm values. So the that the later adjusts for the number of features in a prediction
2
problem fits in minimization function as: model. In the case of Radjusted , the increase in new features
1 0 can lead to its increase if the newly added features are useful
ββ
J(β) = (7) to the prediction model. However, if the newly added features
2 2
are useless, its value will decrease. The Radjusted can be
with a special condition of the values of all residuals not more defined as: :
than ε, as in the following equation:
2 n−1
Radjusted = 1 − (1 − R2 ) (11)
∀n : |yn − (x0n β + b)| ≤ ε (8) n − (k + 1)
Here, n is the sample size and k is the number of independent
4) Exponential Smoothing variables in the regression equation.
In exponential smoothing family methods, forecasting is
done based on previous periods’ data. The past data obser- 3) Mean Absolute Error (MAE)
vations’ influence is decaying exponentially as they become The mean absolute error is the average magnitude of the
older. Thus the weight assigned to different lag values is errors in the set of model predictions [22], [23]. This is an
geometrically declined. ES is a very simple powerful time average on test data between the model predictions and actual
series forecasting method specifically for univariate data [7], data where all individual differences have equal weight. Its
[19]. The forecast for the current time (Ft ) in ES is given by: matrix value range is from 0 to infinity and fewer score values
show the goodness of learning models that’s the reason it’s
Ft = αAt−1 + (1 − α)Ft−1 (9) also called negatively-oriented scores [24].
Here, α smoothing cost where 0 ≤ α ≤ 1, At−1 is the 1X
n
actual value of the previous period in time series, Ft−1 is the M AE = |yj − yˆj | (12)
n j=1
forecast value of the previous forecast.

C. EVALUATION PARAMETERS 4) Mean Square Error (MSE)


In this study, we evaluate the performance of each of the Mean square error is another way to measure the perfor-
learning models in terms of R-squared (R2 ) score, Adjusted mance of regression models [22]. MSE takes the distance
2
R-Square (Radjusted ), mean square error (MSE), mean abso- of data points from the regression line and squaring them.
lute error (MAE), and root mean square error (RMSE). Squaring is necessary because it removes the negative sign
from the value and gives more weight to larger differences.
1) R-squared Score The smaller mean squared error shows the closer you are to
finding the line of best fit. MSE can be calculated as:
R-squared (R2 ) score is a statistical measure used to evaluate
the performance of regression models [20], [21]. The statis- n
1X
tic shows the dependent variable’s variance percentage that M SE = (yi − yˆi )2 (13)
n i=1
collectively determines the independent variable. It measures
the relationship strength between the dependent variable and
regression models on a convenient 0 – 100% scale. After 5) Root Mean Square Error (RMSE)
training the regression model, we can check the goodness-of- Root mean square error can be defined as the standard devi-
fit of trained models by using the R2 score. R2 score finds the ation of the prediction errors. Prediction errors also known
scatteredness of data points around the regression line which as residuals is the distance from the best fit line and actual
can also be referred to as the coefficient of determination. data points. RMSE is thus a measure of how concentrated the
Its score always between 0 and 100%. 0% score implies actual data points are around the best fit line. It is the error
the response variable has no variability around its mean rate given by the square root of MSE given as follows.
explained by the model, and 100% implies that the response v
u n
variable has all the variability around its mean. The high R2 u1 X
RM SE = t (yi − yˆi )2 (14)
score shows the goodness of the trained model. R2 is a linear n i=1
model that explains the percentage of variation independent
variable. It can be found as:
III. METHODOLOGY
V arianceexplainedbymodel The study is about novel coronavirus also known as COVID-
R2 = (10) 19 predictions. The COVID-19 has proved a present potential
T otalvariance
threat to human life. It causes tens of thousands of deaths
2) Adjusted R-squared Score and the death rate is increasing day by day throughout the
2
The Adjusted R-squared (Radjusted ) is a modified form of globe. To contribute to this pandemic situation control, this
2 2
R , which also like R shows how well the data points fit the study attempts to perform future forecasting on the death
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

rate, the number of daily confirmed infected cases and the


number of recovery cases in the upcoming 10 days. The
forecasting has been done by using four ML approaches
that are appropriate to this context. The dataset used in the
study contains daily time series summary tables, including
the number of confirmed cases, deaths, and recoveries in
the past number of days from which the pandemic started.
Initially, the dataset has been preprocessed for this study
to find the global statistics of the daily number of deaths,
confirmed cases, and recoveries. The resulred time-series has
been extracted from the reported data as shown in Table 4,
the samples of the resulted dataset are shown in Tables 5, 6,
7 respectively.

TABLE 4: Sample of data from worldwide cases time-series


Category Province Country Lat Long 1/22/20 . . . 1/27/20
/State /Region
Death Victoria Australia -12.4 130.84 0 ... 0
Nan Canada 0.000 0.000 0 ... 1
NaN Algeria 28.03 1.65 0 ... 19
Recovery Colombia Canada 49. 28 -123. 1 0 ... 4
Victoria Australia -37. 8 144. 96 0 ... 70 FIGURE 1: Proposed Workflow
NaN Algeria 28.03 1.65 0 ... 65
New NaN Afghan 33.00 65.00 0 ... 74
Confirmed Victoria Australia -37. 8 144. 96 0 ... 411
NaN Algeria 28.03 1.65 0 ... 264 number of deaths due to COVID-19 worldwide. As the death
rate and confirmed cases are increasing day by day which is
an alarming situation for the world. The number of people
TABLE 5: Day wise total death cases sample data who can be affected by the COVID-19 pandemic in different
Day 1 deaths Day 2 deaths ... Day 66 deaths countries of the world is not well known. This study is an
0 4 ... 20 attempt to forecast the number of people that can be affected
in terms of new infected cases and deaths including the
number of expected recoveries for the upcoming 10 days.
TABLE 6: Day wise total recoveries rate sample data
Four machine learning models LR, LASSO, SVM, and ES
Day 1 recoveries Day 2 recoveries ... Day 66 recoveries have been used to predict the number of newly infected cases,
0 6 ... 139 the number of deaths, and the number of recoveries.

A. DEATH RATE FUTURE FORECASTING


TABLE 7: Day wise total new confirmed cases sample data
The study performs predictions on death rate and according
Day 1 new cases Day 2 new cases ... Day 66 new cases
to results ES performs better among all the models, LR
0 21 ... 749
and LASSO perform equally well and achieve almost the
After the initial data preprocessing step, the dataset has same R2 score. In comparison, SVM performs worst in this
been divided into two subsets: a training set (56 days) to train situation. The results are shown in Table 8.
the models and testing set (10 days). The learning models TABLE 8: Models performance on future forecasting for
such as SVM, LR, LASSO, and ES have been used in this death rate
study. These models have been trained on the days and newly
Model R2 Score 2
RAdjusted MSE MAE RMSE
confirmed cases, recovery, and death patterns. The learning
LR 0.96 0.95 840240.11 723.11 916.64
models have then been evaluated based on important metrics LASSO 0.85 0.81 3244066.79 1430.29 1801.12
such as R2 -score, Radjusted
2
score MSE, RMSE, and MAE SVM 0.53 0.39 16016210.98 3129.74 4002.02
and reported in the results. The proposed approach used in ES 0.98 0.97 662228.72 406.08 813.77
the study has been shown as a block diagram Figure 1.
Figures 2, 3, 4 and 5 show the performance of LR, LASSO,
IV. RESULTS & DISCUSSION SVM, and ES models respectively in the form of graphs.
This study attempts to develop a system for the future fore- Graphs in all figures predict that the death rate will increase
casting of the number of cases affected by COVID-19 using in upcoming days which is a very alarming sign. The current
machine learning methods. The dataset used for the study mortality rate plotted in the graph in Figure 14 shows the
contains information about the daily reports of the number models’ predictions correct.
of newly infected cases, the number of recoveries, and the
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

in this study. ES and LASSO lead the table in terms of


performace, LR also performed good, while SVM performs
very poorly in terms of all the evaluation metrics. Graphs in
figures 6, 7, 8, 9 show the predictions of learning models.

TABLE 9: Models performance on future forecasting for new


infected confirm cases
Model R2 Score 2
RAdjusted MSE MAE RMSE
LR 0.83 0.79 1472986504.96 30279.55 38390.51
LASSO 0.98 0.97 234489560.99 11693.97 15322.11
SVM 0.59 0.47 5760890969.30 60177.90 75911.28
FIGURE 2: Death prediction by LR for the upcoming 10 days ES 0.98 0.97 283201302.2 8867.43 16828.58

FIGURE 3: Death prediction by LASSO for the upcoming 10


days
FIGURE 6: New infected confirm cases prediction by LR for
the upcoming 10 days

FIGURE 4: Death prediction by SVM for the upcoming 10


days
FIGURE 7: New infected confirm cases prediction by
LASSO for the upcoming 10 days

FIGURE 5: Death prediction by ES for the upcoming 10 days

B. NEW INFECTED CONFIRM CASES’ FUTURE


FORECASTING FIGURE 8: New infected confirm cases prediction by SVM
The new confirmed cases of COVID-19 increase day by day for the upcoming 10 days
Table 9 shows the forecasting results of the models used
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 9: New infected confirm cases prediction by ES for FIGURE 11: Recovery rate prediction by LASSO for the
the upcoming 10 days upcoming 10 days

C. RECOVERY RATE FUTURE FORECASTING


In recovery rate future forecasting the ES again performs
better among all the other models. All other models perform
poorly, the order of performance from best to worst is ES
is best followed by LR, LASSO and SVM due to the nature
of available time-series data. The prediction trends for the
coming days are shown in Figures 10, 11, 12, and 13. The
performance results of learning models are shown in Table
10 below:
TABLE 10: Models performance on future forecasting for FIGURE 12: Recovery rate prediction by SVM for the up-
recovery rate coming 10 days
Model R2 Score 2
RAdjusted MSE MAE RMSE
LR 0.39 0.21 480922814.51 17016.08 21929.95
LASSO 0.29 0.08 1462144344.82 30705.27 38237.99
SVM 0.24 0.02 13121148615.72 106739.82 114547.58
ES 0.99 0.99 5970634.07 1827.85 2443.48

FIGURE 13: Recovery rate prediction by ES for the upcom-


ing 10 days

graph in Figure 15 follows the same pattern which proves the


FIGURE 10: Recovery rate prediction by LR for the upcom- model predictions correct.
ing 10 days

However, comparing the current recovery statistics (Figure


19) with our models’ predictions, the ES prediction is follow-
ing the trends which are very close to the actual situation.
Besides, some more analysis has been performed after 5
days of experiments on the updated dataset and some impor-
tant statistics have been found as shown in Figure14, 15, 16,
and 17. Figure 14 and 15 show that our model predictions are
quite promising, because the models predict that in upcoming
days death rate will be increased and the graph of mortality
rate shows the same pattern and in recovery scenario models FIGURE 14: Mortality rate after 5 days of this study predic-
predict that recoveries rate will be slowed down and recovery tion

VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the model has been used for further analysis with interval
prediction [7]. Figure 18 presents the model performance on
the death rate, recovery rate, and new confirmed cases with
15 days interval period.
First, all the models have been trained from the dataset
of 22 Jan 2020 to 16 Feb 2020, and predictions were made
for the upcoming 10 days from 16/02/2020. Since the data
available in this dataset was of only 26 days. Due to the
availability of a very small sized dataset, three models LR,
LASSO, and SVM couldn’t perform very well in prediction
results as reported in Table 11. However, ES performs better
even on the limited number of records in the dataset as shown
FIGURE 15: Recovery rate after 5 days of this study predic- in the graphs of Figure 18.
tion In the second model training interval, the models were
trained from the dataset of 22 Jan 2020 to 02 Mar 2020, data
of 15 more days were added to the training set to predict
the outcome of the upcoming 10 days from 02 Mar 2020.
Now the dataset contained data of 41 days, the models LR,
LASSO, and SVM still could not perform well in all predic-
tion classes. However, the ES in this phase also performed
very well as can be seen in graphs of Figure 18.
In the third interval next 15 days were added to the dataset.
The size of the training dataset in this interval was 56, as can
be seen in the results LR was significantly improved and also
the LASSO had shown some improvement. ES in this interval
while performing good shows some deviation as shown in the
graphs of Figure 18, from the actual data series because of a
sudden rise in all the three cases in this period.
In the fourth Interval data of 10 more days have been added
FIGURE 16: Comparison between death rate, recovery rate
increasing the size of the training set to 66, in this interval all
and confirm case rate after 5 days of this study prediction
the models can be seen as improved very significantly and
making the overall results very near to the actual situation.
However, ES outperforms all the models in the prediction of
all three cases.
In general, ES performed best followed by LR performed
followed by LASSO and then SVM. The prediction results
have been compared with the actual data reports of these
particular day intervals. The predictions results provided by
these models have been found very closer to the actual
reports. The interval details have been compiled and given
in Table 11.
TABLE 11: Models performance on future forecasting for
recovery rate
Interval Dataset Dates LASSO LR Perfor- SVM Per- ES
FIGURE 17: Ratio between recovery rate and death rate after Size (From 22 Perfor- mance formance Perfor-
(Number Jan 2020) mance mance
5 days of this study prediction of Days) To
1. 26 16 Feb Very poor Very poor Very poor Best
2020
2. 41 2 Mar Very poor Very poor Very poor Best
D. MODEL PERFORMANCES WITH 10-15 DAYS 2020
PREDICTION INTERVALS 3. 56 17 Mar Poor Good Very poor Best
2020
As shown in the previous sections, ES performed best in all 4. 66 27 Mar Better Best Well Best
three cases such as, death rate forecasting, the number of new 2020 improved
confirmed cases forecasting, and recovery rate forecasting.
Considering the best performance given by ES model in To continue and extend further the scope of the of this
all the three forecasting cases among all the four models, study in forecasting. The same methodology has been applied
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 19: All models predictions form 1/22/2020 to


4/6/2020 and real situation form 1/22/2020 to 4/6/2020

values. Graphs in Figures 20, 21, and 22 show the prediction


with interval, actual value, and error bar for newly confirmed
cases, death rate, and recovery rate respectively.

TABLE 12: Prediction intervals using LR in all three cases


(death rate, new confirmed cases, recovery rate)
FIGURE 18: ES performances on death rate, recovery rate
Cases Significance Prediction Range True value
and new confirmed case with 10-15 days intervals Level Interval
Recovery 80% 14403.95 -14375.95 to 14431.95 -23474.35
90% 18511.33 -18483.33 to 18539.33 -23474.35
95% 22056.05 -22028.05 to 22084.05 -23474.35
to further forecast the number of confirmed cases, deaths, and New Con- 80% 94320.84 -93765.84 to 94875.84 -64476.148
recoveries up to 6 Apr 2020. Figure 19 presents the plots firmed
of confirmed cases, deaths, and recoveries on the first four 90% 121217.02 -120662.02 to 121772.02 -64476.14
95% 144428.79 -143873.79 to 144983.79 -64476.14
panes followed by the plot of actual situation gathered from Death 80% 4719.35 -4702.35 to 4736.35 -3488.37
the actual data reports of the sampling period of the study in 90% 6065.10 -6048.10 to 6082.10 -3488.37
the fifth pane. The results in the graphs indicate that the ML 95% 7226.50 -7209.50 to 7243.50 -3488.37
models used in this study befit the forecasting task making
the way towards the usability of the study and future research
of the similar nature.

E. PREDICTION INTERVALS OF LR FOR FORECASTING


UNCERTAINTY
A prediction interval is a quantification of the uncertainty
on a prediction. It provides a probabilistic upper and lower
bounds on the estimate of an outcome variable [25]. To
evaluate this uncertainty we perform prediction intervals on
LR, because among three regression models (LR, LASSO,
and SVM), in general, LR performs better in all three cases
(death rate forecasting, new confirmed cases forecasting, FIGURE 20: Prediction intervals using LR for new confirmed
recovery rate forecasting). The results can be seen in Table 12 forecasting
showing the prediction intervals along with ranges and true
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

be one of the primary focuses in our future work.

ACKNOWLEDGMENT
This research was partially supported by National Research
of Korea (NRF) grant funded by Korea government (MSIT)
(No. NRF-2019R1F1A1060752) in part by the Fareed Com-
puting Research Center, Department of Computer Science
under Khwaja Fareed University of Engineering and Infor-
mation Technology (KFUEIT), Punjab, Rahim Yar Khan,
Pakistan.
FIGURE 21: Prediction intervals using LR for death rate
forecasting REFERENCES
[1] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and ma-
chine learning forecasting methods: Concerns and ways forward,” PloS
one, vol. 13, no. 3, 2018.
[2] G. Bontempi, S. B. Taieb, and Y.-A. Le Borgne, “Machine learning
strategies for time series forecasting,” in European business intelligence
summer school. Springer, 2012, pp. 62–77.
[3] F. E. Harrell Jr, K. L. Lee, D. B. Matchar, and T. A. Reichert, “Regression
models for prognostic prediction: advantages, problems, and suggested
solutions.” Cancer treatment reports, vol. 69, no. 10, pp. 1071–1077, 1985.
[4] P. Lapuerta, S. P. Azen, and L. LaBree, “Use of neural networks in
predicting the risk of coronary artery disease,” Computers and Biomedical
Research, vol. 28, no. 1, pp. 38–52, 1995.
[5] K. M. Anderson, P. M. Odell, P. W. Wilson, and W. B. Kannel, “Cardio-
vascular disease risk profiles,” American heart journal, vol. 121, no. 1, pp.
293–298, 1991.
[6] H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, “Using machine
FIGURE 22: Prediction intervals using LR for recovery rate learning algorithms for breast cancer risk prediction and diagnosis,” Pro-
forecasting cedia Computer Science, vol. 83, pp. 1064–1069, 2016.
[7] F. Petropoulos and S. Makridakis, “Forecasting the novel coronavirus
covid-19,” Plos one, vol. 15, no. 3, p. e0231236, 2020.
[8] G. Grasselli, A. Pesenti, and M. Cecconi, “Critical care utilization for the
V. CONCLUSION
covid-19 outbreak in lombardy, italy: early experience and forecast during
The precariousness of the COVID-19 pandemic can ignite an emergency response,” Jama, 2020.
a massive global crisis. Some researchers and government [9] WHO. Naming the coronavirus disease (covid-19) and the virus that causes
it. [Online]. Available: https://www.who.int/emergencies/diseases/novel-
agencies throughout the world have apprehensions that the
coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-
pandemic can affect a large proportion of the world pop- (covid-2019)-and-the-virus-that-causes-it
ulation [26], [27]. In this study, an ML-based prediction [10] C. P. E. R. E. Novel et al., “The epidemiological characteristics of an out-
system has been proposed for predicting the risk of COVID- break of 2019 novel coronavirus diseases (covid-19) in china,” Zhonghua
liu xing bing xue za zhi= Zhonghua liuxingbingxue zazhi, vol. 41, no. 2,
19 outbreak globally. The system analyses dataset contain- p. 145, 2020.
ing the day-wise actual past data and makes predictions [11] L. van der Hoek, K. Pyrc, M. F. Jebbink, W. Vermeulen-Oost, R. J.
for upcoming days using machine learning algorithms. The Berkhout, K. C. Wolthers, P. M. Wertheim-van Dillen, J. Kaandorp,
J. Spaargaren, and B. Berkhout, “Identification of a new human coron-
results of the study prove that ES performs best in the current avirus,” Nature medicine, vol. 10, no. 4, pp. 368–373, 2004.
forecasting domain given the nature and size of the dataset. [12] J. H. U. data repository. Cssegisanddata. [Online]. Available:
LR and LASSO also perform well for forecasting to some https://github.com/CSSEGISandData
[13] M. R. M. Talabis, R. McPherson, I. Miyamoto, J. L. Martin, and D. Kaye,
extent to predict death rate and confirm cases. According to “Chapter 1 - analytics defined,” in Information Security Analytics,
the results of these two models, the death rates will increase M. R. M. Talabis, R. McPherson, I. Miyamoto, J. L. Martin, and
in upcoming days, and recoveries rate will be slowed down. D. Kaye, Eds. Boston: Syngress, 2015, pp. 1 – 12. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/B9780128002070000010
SVM produces poor results in all scenarios because of the ups [14] H.-L. Hwa, W.-H. Kuo, L.-Y. Chang, M.-Y. Wang, T.-H. Tung, K.-J.
and downs in the dataset values. It was very difficult to put an Chang, and F.-J. Hsieh, “Prediction of breast cancer and lymph node
accurate hyperplane between the given values of the dataset. metastatic status with tumour markers using logistic regression models,”
Journal of evaluation in clinical practice, vol. 14, no. 2, pp. 275–280, 2008.
Overall we conclude that model predictions according to the [15] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal
current scenario are correct which may be helpful to under- of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1,
stand the upcoming situation. The study forecasts thus can pp. 267–288, 1996.
[16] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for
also be of great help for the authorities to take timely actions
nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
and make decisions to contain the COVID-19 crisis. This [17] X. F. Du, S. C. Leung, J. L. Zhang, and K. K. Lai, “Demand forecasting
study will be enhanced continuously in the future course, of perishable farm products using support vector machine,” International
next we plan to explore the prediction methodology using journal of systems Science, vol. 44, no. 3, pp. 556–567, 2013.
[18] F. Rustam, I. Ashraf, A. Mehmood, S. Ullah, and G. S. Choi, “Tweets
the updated dataset and use the most accurate and appropriate classification on the base of sentiments for us airline companies,” Entropy,
ML methods for forecasting. Real-time live forecasting will vol. 21, no. 11, p. 1078, 2019.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[19] E. Cadenas, O. A. Jaramillo, and W. Rivera, “Analysis and forecasting ARIF MEHMOOD received his Ph.D. degree in
of wind velocity in chetumal, quintana roo, using the single exponential the Department of Information & Communication
smoothing method,” Renewable Energy, vol. 35, no. 5, pp. 925–930, 2010. Engineering, Yeungnam University, Korea (Feb-
[20] J. Lupón, H. K. Gaggin, M. de Antonio, M. Domingo, A. Galán, 2014 to Nov-2017). He is working as Assistant
E. Zamora, J. Vila, J. Peñafiel, A. Urrutia, E. Ferrer et al., “Biomarker- Professor, Department of Computer Science &
assist score for reverse remodeling prediction in heart failure: the st2-r2 IT, The Islamia University of Bahawalpur, PAK-
score,” International journal of cardiology, vol. 184, pp. 337–343, 2015. ISTAN. His recent research interests are related
[21] J.-H. Han and S.-Y. Chi, “Consideration of manufacturing data to apply
to Data Mining, mainly working on AI and Deep
machine learning methods for predictive manufacturing,” in 2016 Eighth
Learning-based Text Mining, and Data Science
International Conference on Ubiquitous and Future Networks (ICUFN).
IEEE, 2016, pp. 109–113. Management Technologies.
[22] C. J. Willmott and K. Matsuura, “Advantages of the mean absolute error
(mae) over the root mean square error (rmse) in assessing average model
performance,” Climate research, vol. 30, no. 1, pp. 79–82, 2005.
[23] R. Kaundal, A. S. Kapoor, and G. P. Raghava, “Machine learning tech-
niques in disease forecasting: a case study on rice blast prediction,” BMC SALEEM ULLAH was born in AhmedPur East,
bioinformatics, vol. 7, no. 1, p. 485, 2006.
Pakistan in 1983. He received his B.Sc. and MIT
[24] S. Baran and D. Nemoda, “Censored and shifted gamma distribution
degrees in Computer Science from Islamia Univer-
based emos model for probabilistic quantitative precipitation forecasting,”
Environmetrics, vol. 27, no. 5, pp. 280–292, 2016. sity Bahawalpur and Bahauddin Zakariya Univer-
[25] Y. Grushka-Cockayne and V. R. R. Jose, “Combining prediction intervals sity (Multan) in 2003 and 2005 respectively. From
in the m4 competition,” International Journal of Forecasting, vol. 36, no. 1, 2006 to 2009, he worked as a Network/IT Ad-
pp. 178–185, 2020. ministrator in different companies. He completed
[26] N. C. Mediaite. Harvard professor sounds alarm on ‘likely’ coronavirus his Doctorate degree from Chongqing University,
pandemic: 40% to 70% of world could be infected this year. Accessed on China in 2012. From August 2012 to Feb 2016, he
2020.02.18. [Online]. Available: https://www.mediaite.com/news/harvard- worked as an Assistant Professor in Islamia Uni-
professor-sounds-alarm-on-likely-coronavirus-pandemic-40-to-70-of- versity Bahawalpur, Pakistan. Currently, he is working as an Associate Dean
world-could-be-infected-this-year/ in Khwaja Fareed University of Engineering & Information Technology,
[27] BBC. Coronavirus: Up to 70% of germany could become Rahim Yar Khan since February 2016. He has almost 14 years of Industry
infected - merkel. Accessed on 2020.03.15. [Online]. Available: experience in field of IT. He is an active researcher in the field of Adhoc
https://www.bbc.com/news/world-us-canada-51835856 Networks, IoTs, Congestion Control, Data Science, and Network Security.

BYUNG-WON ON received his Ph.D. degree in


FURQAN RUSTAM received his MCS degree Department of Computer Science and Engineer-
in the Department of Computer Science, Islamia ing, Pennsylvania State University at University
University of Bahawalpur, Pakistan (Oct-2015 to Park, PA, USA in 2007. Then, he worked as
Oct-2017). Since Nov-2018, he got himself en- a full- time researcher in University of British
rolled in Master of Computer Science, Depart- Columbia, Advanced Digital Sciences Center, and
ment of Computer Science, Khwaja Fareed Uni- Advanced Institutes of Convergence Technology
versity of Engineering and Information Technol- for seven years. Since 2014, he has been a faculty
ogy (KFUEIT), Rahim Yar Khan, 64200, Pak- member in Department of Software Convergence
istan. He is also serving as Research Assistant at Engineering, Kunsan National University, Korea.
Fareed Computing & Research Center, KFUEIT, His recent research interests are related to Data Mining(esp. Probability
Pakistan. His recent research interests are related to Data Mining, Machine Theory and Applications), Machine Learning, and Artificial Intelligence,
Learning, and Artificial Intelligence, mainly working on Creative Comput- mainly working on Abstractive Summarization, Creative Computing, and
ing, and Supervised Machine Learning. Multi-Agent Reinforcement Learning.

WAQAR ASLAM received the M.Sc. degree in


computer science from Quaid-i-Azam University,
Islamabad, Pakistan, and the Ph.D. degree in com-
AIJAZ AHMAD RESHI received his Ph.D. degree puter science from the Eindhoven University of
from Department of Computer Science, BIHER, Technology, The Netherlands. For his Ph.D. de-
Bharath University,Chennai, India in 2015. He gree, he received the Overseas Scholarship, HEC,
is working as Assistant Professor in College of Pakistan. He is currently an Assistant Professor
Computer Science and Engineering, Dapartment with the Computer Science & IT, The Islamia
of Computer Science, Taibah University Al Mad- University of Bahawalpur, Pakistan. His research
ina Al Munawarah, Saudi Arabia. His recent re- interests include performance modeling & QoS of
search interests include Machine Learning, Deep wireless/computer networks, performance modeling of (distributed) soft-
Learning, Internet of Things (IoT), Web of Things ware architectures, radio resource allocation, the Internet of Things, Fog
(WoT) and Wireless Sensor and Actuator Net- Computing, effort/time/cost estimation of software development in (dis-
works. tributed) Agile setups, social network data analysis, and DNA/Chaos-based
information security.

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2997311, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

GYU SANG CHOI received his Ph.D. in the De-


partment of Computer Science and Engineering,
Pennsylvania State University, University Park,
PA, USA in 2005. He was a research staff mem-
ber at Samsung Advanced Institute of Technology
(SAIT) for Samsung Electronics from 2006 to
2009. Since 2009, he has been a faculty member in
the Department of Information & Communication,
Yeungnam University, Korea. His research areas
include non- volatile memory and storage systems.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like