Nothing Special   »   [go: up one dir, main page]

UNIT 2 Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

UNIT-II

Predictors are not just related to the response but also related among themselves, a characteristic of
data called multicollinearity.

Using multiple features:


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import matplotlib as mpl

from sklearn.datasets import load_boston

from sklearn import linear_model

boston = load_boston()

dataset = pd.DataFrame(boston.data, columns=boston.feature_names)

dataset['target'] = boston.target

observations = len(dataset)

variables = dataset.columns[:-1]

X = dataset.ix[:,:-1]

y = dataset['target'].values

Model building with Statsmodels:


Let's reload the necessary modules from the package (one working with matrices and the other with
formulas):

import statsmodels.api as sm

import statsmodels.formula.api as smf

Let's also prepare a suitable input matrix, naming it Xc after having it incremented by an extra
column containing the bias vector (a constant variable having the unit value):

Xc = sm.add_constant(X)

linear_regression = sm.OLS(y,Xc)

fitted_model = linear_regression.fit()
After having fitted the preceding specified model, let's immediately ask for a summary:

fitted_model.summary()
First, the adjusted R-squared is something to take note of now. When working with multiple
variables, the standard R-squared can get inflated because of the many coefficients inserted into the
model. If you are using too many predictors, its measure will diverge perceptibly from the plain R-
squared. The adjusted R-squared considers the complexity of the model and reports a much more
realistic R-squared measure.

NOTE: Just make a ratio between the plain and the adjusted R-squared measure. Check if their
difference exceeds 20%. If it does, it means that we have introduced some redundant variables
inside our model specification. Naturally, the larger the ratio difference, the more serious the
problem.

This is not the case in our example because the difference is quite slight, approximately between
0.741 and 0.734, which translated into a ratio turns out to be 0.741/0.734 = 1.01, that is just 1% over
the standard R-squared.
Then, working with so many variables at a time, coefficients should also be checked for important
warnings. The risk involved is having coefficients picking up noisy and non-valuable information.
Usually such coefficients will not be far from zero and will be noticeable because of their large
standard errors. Statistical t-tests are the right tool to spot them.

NOTE: Be aware that variables with a low p-value are good candidates for being removed from the
model because there will probably be little proof that their estimated coefficient is different from
zero.

In our example, being largely not significant (p-value major of 0.05), the AGE and INDUS variables are
represented in the model by coefficients whose usefulness could be seriously challenged.

Finally, the condition number test (Cond. No.) is another previously mentioned statistic that now
acquires a fresh importance under the light of a system of predictors. It signals numeric unstable
results when trying an optimization based on matrix inversion. The cause of such instability is due to
multicollinearity, a problem we are going to expand on in the following paragraphs.

NOTE: When a condition number is over the score of 30, there's a clear signal that unstable results
are rendering the result less reliable. Predictions may be affected by errors and the coefficients may
drastically change when rerunning the same regression analysis with a subset or a different set of
observations.

Using formulae as an alternative:


To obtain the same results using the statsmodels.formula.api and thereby explicating a formula to
be interpreted by the Patsy package

linear_regression = smf.ols(formula = 'target ~ CRIM + ZN +INDUS + CHAS + NOX + RM + AGE + DIS +


RAD + TAX + PTRATIO + B + LSTAT', data=dataset)

fitted_model = linear_regression.fit()

In this case, you have to explicate all the variables to enter into model building by naming them on
the right side of the formula. After fitting the model, you can use all the previously seen Statsmodels
methods for reporting the coefficients and results.

The correlation matrix:


Having more variables in the analysis now, we are still quite interested in how each predictor relates
to the response; however, we have to distinguish whether the relation between the variance of the
predictor and that of the target is due to unique or shared variance.

The measurement of the association due to unique variance is called partial correlation and it
expresses what can be guessed of the response thanks to the information uniquely present in a
variable. It represents the exclusive contribution of a variable in predicting the response, its unique
impact as a direct cause to the target
The shared variance is instead the amount of information that is simultaneously present in a variable
and in other variables in the dataset at hand. Shared variance can have many causes; maybe one
variable causes or it just interferes with the other. Shared variance, otherwise called collinearity
(between two variables) or multicollinearity,

For the statistical approach, it has to be said that high or near perfect multicollinearity not only often
renders coefficient estimations impossible (matrix inversion is not working), but also, when it is
feasible, it will be affected by imprecision in coefficient estimation, leading to large standard errors
of the coefficients.

Having multicollinear variables, in fact, renders it difficult to select the correct variables for the
analysis (since the variance is shared, it is difficult to figure out which variable should be its causal
source), leading to sub-optimal solutions that could be resolved only by augmenting the number of
observations involved in the analysis.

To determine the manner and number of predictors affecting each other, the right tool is a
correlation matrix, which, though a bit difficult to read when the number of the features is high, is
still the most direct way to ascertain the presence of shared variance:

X = dataset.ix[:,:-1]

correlation_matrix = X.corr()

print (correlation_matrix)
At first glance, some high correlations appear to be present in the order of the absolute value of 0.70
(highlighted by hand in the matrix) between TAX, NOX, INDUS, and DIS. That's fairly explainable since
DIS is the distance from employment centers, NOX is a pollution indicator, INDUS is the quota of
non-residential or commercial buildings in the area, and TAX is the property tax rate.

A faster, but less numerical representation is to build a heat map of the correlations:

def visualize_correlation_matrix(data, hurdle = 0.0):

R = np.corrcoef(data, rowvar=0)

R[np.where(np.abs(R)<hurdle)] = 0.0

heatmap = plt.pcolor(R, cmap=mpl.cm.coolwarm, alpha=0.8)

heatmap.axes.set_frame_on(False)

heatmap.axes.set_yticks(np.arange(R.shape[0]) + 0.5, minor=False)

heatmap.axes.set_xticks(np.arange(R.shape[1]) + 0.5, minor=False)

heatmap.axes.set_xticklabels(variables, minor=False)

plt.xticks(rotation=90)
heatmap.axes.set_yticklabels(variables, minor=False)

plt.tick_params(axis='both', which='both', bottom='off', \

top='off', left = 'off', right = 'off')

plt.colorbar()

plt.show()

visualize_correlation_matrix(X, hurdle=0.5)

This will give the following output:

Having a cut at 0.5 correlation (which translates into a 25% shared variance), the heat map
immediately reveals how PTRATIO and B are not so related to other predictors. As a reminder of the
meaning of variables, B is an indicator quantifying the proportion of colored people in the area and
PTRATIO is the pupil-teacher ratio in the schools of the area. Another intuition provided by the map
is that a cluster of variables, namely TAX, INDUS, NOX, and RAD, is confirmed to be in strong linear
association.

An even more automatic way to detect such associations (and figure out numerical problems in a
matrix inversion) is to use eigenvectors. Explained in layman's terms, eigenvectors are a very smart
way to recombine the variance among the variables, creating new features accumulating all the
shared variance. Such recombination can be achieved using the NumPy linalg.eig function, resulting
in a vector of eigenvalues (representing the amount of recombined variance for each new variable)
and eigenvectors (a matrix telling us how the new variables relate to the old ones):

corr = np.corrcoef(X, rowvar=0) eigenvalues, eigenvectors = np.linalg.eig(corr)


After extracting the eigenvalues, we print them in descending order and look for any element whose
value is near to zero or small compared to the others. Near zero values can represent a real
problem for normal equations and other optimization methods based on matrix inversion. Small
values represent a high but not critical source of multicollinearity. If you spot any of these low
values, keep a note of their index in the list

print (eigenvalues)

Out: [ 6.12265476 1.43206335 1.24116299 0.85779892 0.83456618 0.65965056


0.53901749 0.39654415 0.06351553 0.27743495 0.16916744 0.18616388
0.22025981]

Using their index position in the list of eigenvalues, you can recall their specific vector from
eigenvectors, which contains all the variable loadings—that is, the level of association with the
original variables. In our example, we investigate the eigenvector at index 8. Inside the eigenvector,
we notice values at index positions 2, 8, and 9, which are indeed outstanding in terms of absolute
value:

print (eigenvectors[:,8])

Out: [-0.04552843 0.08089873 0.25126664 -0.03590431 -0.04389033

-0.04580522 0.03870705 0.01828389 0.63337285 -0.72024335


-0.02350903 0.00485021 -0.02477196]

We now print the variables' names to know which ones contribute so much by their values to build
the eigenvector:

print (variables[2], variables[8], variables[9])

Out: INDUS RAD TAX

Removal of some of the multicollinearity culprits is usually the best solution.

You might also like