R Assignment - PDF Vegulla
R Assignment - PDF Vegulla
R Assignment - PDF Vegulla
by
Vegulla Manikantha
FMS/MBA/21-23/041
Submitted to:
Prof. Dr. Bhagirathi Nayak, Professor, FMS, Sri Sri University
CUTTACK-754006
Date:
1. The data set is imported in R-Studio via the following steps: File > Import
Dataset > From Excel > Browse (TJ)> Import.
2. Column titled “SEC Fillings “is removed using the subset function. This
function allows us to create a subset of the original dataset by consisting of
the variables of our choice. As shown in , we mention our dataset (TJ), then
select the variables to be eliminated [select= -c(SEC Fillings)]. This subset is
then redirected to a variable named “input”.
2|Page
3. The correlation matrix along with the respective significance value (p-value)
of the entire dataset is constructed. The rcorr() function is obtained from
the ‘Hmisc’ package and is used to derive both correlation matrix and the p-
value matrix of the concerned dataset simultaneously. The rcorr() will only
work with matrices, thus we introduce our dataset (input) as a matrix using
the ‘as.matrix’ function. This is redirected to a variable named “result”.
The correlation coefficient allows us to gauge two aspects:
i) The level of interrelatedness (-1 to 0 = minimum correlation; 0 = no
correlation; 0 to 1= maximum correlation)
ii) The direction of correlation (positive correlation= one variable
increase then other variable correspondingly increases and vice-versa;
negative correlation= one variable increases, the other variable
correspondingly decreases and vice-versa)
The p-value is taken in order to know whether there exists any significant
relation between the two variables with respect to population because a
correlation coefficient different from 0 in the sample does not mean that
the correlation is significantly different from 0 in the population.
3|Page
4. For the purpose of this analysis, the objective is to examine the correlation
between price of the stock and the following variables respectively, namely-
price/earnings, dividend yield, market cap.
4|Page
The meaning of the terms is: ‘result1$r’ – denotes that the plot be constructed
based on the correlation coefficient of the variables. ‘type’ denotes the layout
of the graph. ‘order’ denotes that the variables will be arranged on the basis of
their respective correlation coefficient. ‘p.mat’ would include the p-value in the
plot as well. ‘sig.level’ sets the level of significance (in this case, 0.01), The
‘insig’ function specifies the action to be taken for the points that are greater
than the significant value (in this case, we enter “blank” indicating that points
greater than 0.01 will be left blank)
5|Page
TIME-SERIES FORECASTING:
A time series data set consists of a set of observations about a single
phenomenon that are recorded over multiple time intervals with constant time
period difference (either daily, weekly, monthly, quarterly, annually etc.). For
this analysis, a dataset consisting of the GDP of a country in each quarter of
every year (from 1st quarter of 1959 to 1st quarter of 2001) is imported into R
studio, consisting of 169 observations and 2 variables in total. The class of the
imported data is in tabular (data frame) format.
This dataset has to be converted into a timeseries which is depicted in the
following screenshot. The syntax depicts that we are choosing to convert the
GDP variable as time series, with the start and end date taken as the minimum
and maximum of the corresponding GDP date respectively and the frequency
of observation is done in a quarterly basis (frequency =4). The timeseries is
redirected to a variable named “gdptime”. A plot of the timeseries is
constructed for visualisation.
The model used for this analysis is ARIMA model. ARIMA can be expanded as
“Autoregressive Integrated Moving Average” model and this model requires
that the timeseries data must be satisfying the following conditions:
i) The time-series data must be stationary [(i.e.) the lag values of the
variable (GDP) should have the same mean, variance and covariance]
6|Page
ii) The time-series data must have autocorrelation [(i.e.) the values of GDP
must not correlate itself with the lag values of the same variable (GDP)]
In order to check these conditions, we conduct the following tests, namely:
• Augmented Dickey Fuller (adf) test: If p- value is less than 0.05, then the
dataset is stationary.
• Auto & Cross-covariance & Correlation Function (acf): The acf plot should be
within the dotted blue line that depicts the mean for the dataset to be devoid
of autocorrelation and nonstationarity.
From (Screenshot- 3), we see that the p-value of the (adf) test is 0.99, thereby
proving non-stationarity and the (acf plot) also shows signs of autocorrelation.
7|Page
(d) stands for the order of differentiation (i.e.) the number of times the
timeseries data has to be differentiated in order to eliminate nonstationarity.
Differentiation involves subtracting a value from the immediate lag value
preceding that value.
(q) stand for moving average (i.e.) representing the error of the model as a
combination of the lag error terms. This essentially means that moving average
does the same function as auto-regression but takes the residuals impacting
the variable into account instead of the lag values of the same variable.
8|Page
In order to check the stationarity of the model, we once again perform the
(adf) and (acf) tests on the model, pertaining to the residuals (due to the
presence of the moving average). Screenshot depicts that the p-value is 0.01
and the acf graph shows that the peaks have settled within the mean range.
This shows that the seasonality is satisfied and autocorrelation is eliminated.
9|Page
Thus, the auto.arima has performed these functions in one step thereby saving
ample time. The auto.arima function is redirected to the variable named
“gdpmodel” Now that the time series is stationary, it is now fit for being used
for forecasting. The forecasting is done using the forecast() function wherein
we mention the model to be used for the forecasted (gdpmodel), followed by
the confidence level (95% in this case) and the forecast period (Here we are
forecasting for 10 years and for 4 quarters in each respective year). Screenshot
shows depicts the model and the visualisation as well.
Finally, we check whether the forecast of the model is correct or not. For this,
the “Ljung Bob test” which examines the autocorrelations of the residuals and
deems whether the model is fit or not for giving correct forecast. In this case,
the test is performed for 25 lag values. The p-value is 0.429 thereby indicating
the absence of autocorrelation and verifying the validity of the model.
(Screenshot) gives the depiction of the test and its result.
10 | P a g e
Thus, the timeseries forecasting is performed.
11 | P a g e