Unit 6
Unit 6
Unit 6
com Unit - VI
Unit-6:- Linear Models, Simple Linear Regression, -Multiple Regression Generalized Linear
Models,Logistic Regression, - Poisson Regression- other Generalized Linear Models-Survival
Analysis,Nonlinear Models - Splines, Decision, Random Forests.
Regression:- Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response variable whose value is derived from the
predictor variable.
Linear Regression:- In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight
line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal
to 1 creates a curve.
The general mathematical equation for a linear regression is − y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
lm() Function:-This function creates the relationship model between the predictor and the response
variable.
Syntax: lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Example:-
> height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> relation <- lm(weight~height)
> print(relation)
Call:
lm(formula = weight ~ height)
Coefficients:
(Intercept) height
-38.4551 0.6746
> plot(weight,height,col = "blue",main = "Height & Weight
Regression")
> abline(lm(height~weight),col="orange")
1
n
xy x y
r
1 1
n
( x x) 2
n
( y y) 2
Multiple Regression :- Multiple regression is an extension of simple linear regression. It is used when
we want to predict the value of a variable based on the value of two or more other variables. The
variable we want to predict is called the dependent variable
The general mathematical equation for multiple regression is – x1 =a0+a1x2+a2x3
Following is the description of the parameters used −
x1 is the response variable.
a0, a1, a2...bn are the coefficients.
x1, x2, ...xn are the predictor variables.
The Normal equations for estimating a0,a1 and a2 .
x na a x a x
1 0 1 2 2 3
x x a x a x a x x
1 2 0 2 1 2
2
2 2 3
x x a x a x x a x
1 3 0 3 1 2 3 2 3
2
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given set of
predictor variables using these coefficients.
lm() Function :-This function creates the relationship model between the predictor and the response
variable.
Syntax :- lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between the response variable and predictor
variables.
data is the vector on which the formula will be applied.
Example
> lm(mpg~disp+hp+wt,data=mtcars)
Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
Logistic Regression : The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It actually measures the
probability of a binary response as the value of response variable based on the mathematical equation
relating it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
Syntax :- glm(formula,data,family)
Following is the description of the parameters used −
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. It's value is binomial for logistic
regression.
For example, in the built-in data set mtcars, the data column am represents the transmission type of
the automobile model (0 = automatic, 1 = manual). With the logistic regression equation, we can model
the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.
> am.glm = glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
> am.glm
Coefficients:
(Intercept) hp wt
18.86630 0.03626 -8.08348
Poisson Regression:- Poisson Regression involves regression models in which the response variable is
in the form of counts and not fractional numbers. For example, the count of number of births or number
of wins in a football match series. Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1x1 + b2x2 + bnxn.....
Following is the description of the parameters used −
y is the response variable.
a and b are the numeric coefficients.
x is the predictor variable.
The function used to create the Poisson regression model is the glm()function.
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension
(low, medium or high) on the number of warp breaks per loom. Let's consider "breaks" as the response
variable which is a count of number of breaks. The wool "type" and "tension" are taken as predictor
variables.
> output <-glm(formula = breaks ~ wool+tension,data = warpbreaks,
+ family = poisson)
> print(summary(output))
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CURVE FITTING:- Curve fitting is the process of constructing a curve, or mathematical function, that
has the best fit to a series of data points, possibly subject to constraints.
Fitting of a second
degree polynomial
y = a + bx + cx2 y na b x c x 2
xy a x b x c x
2 3
x y a x b x c x
2 2 3 4
where
Y = log y, A = log a and B = log b
Exponential curve y = a. ebx Applylog on both sides
log y log a bx
Y A bx
Y nA b x
xY A x b x 2
where
Y = log y and A = log a
Exponential curve y = a. xb Applylog on both sides
log y log a b log x
Y A bX
Y nA b X
XY A X b X 2
where
Y = log y , X=log x and A = log a
x x2 y xy
0 0 1 0
1 1 1.8 1.8
2 4 3.3 13.2
3 9 4.5 40.5
4 16 6.3 100.8
x 10 x 2
30 y 16.9 xy 156.3
Substituting the values, we get
5a+10b = 16.9 .......(1)
10a+30b = 156.3 .......(2)
Solving (1) and (2), we get
Multiply eq (1) with 2
10a+20b = 33.2 ........(3)
Subtract (3) and (2)
10a + 20b = 33.2
10a + 30b = 156.3
0 -10b =-123.1
Therefore b=12.3 now substitute in (1) and a = -21.24.
Thus the equation of the straight line is y = a + bx
y = -21.24+12.3x
Solution:-
Polynomial equation line is y = a + bx + cx2
The three normal equations are
y na b x c x 2
xy a x b x c x 2 3
x y a x b x c x
2 2 3 4
x x2 x3 x4 y xy x2y
1 1 1 1 10 10 10
2 4 8 16 12 24 64
3 9 27 81 8 24 72
4 16 64 256 10 40 160
5 25 125 625 14 70 350
x 15 x 2
55 x 3
225 x 4
979 y 54 xy 168 x y 656
2
x y Y = log y xY x2
0 1.05 0.021 0 0
1 2.1 0.324 0.32 1
2 3.85 0.585 1.17 4
3 8.3 0.919 2.75 9
x 6 y 15.3 Y 1.849 xY 4.24 x 2
14
Survival analysis: Survival analysis is generally defined as a set of methods for analyzing data where
the outcome variable is the time until the occurrence of an event of interest. The event can be death,
occurrence of a disease, marriage, divorce, etc.
In survival analysis, there is a special structure for right-censored survival data. To use this, one first
must load the “survival” package, which is included in the main R distribution,
library(survival)
The basic syntax for creating survival analysis in R is −
Surv(time,event)
survfit(formula)
Following is the description of the parameters used −
time is the follow up time until the event occurs.
event indicates the status of occurrence of the expected event.
formula is the relationship between the predictor variables.
Next, define the survival times “tt” and the censoring indicator “status”, where “status = 1” indicates
that the time is an observed event, and “status = 0” indicates that it is censored. Then the “Surv”
function binds them into a single object. In the following example, time 6 is right censored, while the
others are observed event times,
> tt <- c(2, 5, 6, 7, 8)
> status <- c( 1, 1, 0, 1, 1)
> Surv(tt, status) # Create a survival data structure
[1] 2 5 6+ 7 8
Example:-
Nonlinear Models
Decision trees:- Decision tree is a graph to represent choices and their results in form of a tree. The
nodes in the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions. It is mostly used in Machine Learning and Data Mining applications using R.
Examples:
• Predicting an email as spam or not spam,
• Predicting of a tumor is cancerous
• Predicting a loan as a good or bad credit risk based on the factors in each of these.
The package "party" has the function ctree() which is used to create and analyze decison tree.
Syntax : ctree(formula, data)
Following is the description of the parameters used −
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Example:
library(party)
model2<-ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=mydata)
plot(model2)
Random Forests:- In the random forest approach, a large number of decision trees are created. Every
observation is fed into every decision tree. The most common outcome for each observation is used as
the final output.
The package "randomForest" has the function randomForest() which is used to create and analyze
random forests.
Syntax :- randomForest(formula, data)
Following is the description of the parameters used −
formula is a formula describing the predictor and response variables.
data is the name of the data set used.
Advantages
It is one of the most accurate learning algorithms available. For many data sets, it produces
a highly accurate classifier.
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
Disadvantages
Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
For data including categorical variables with different number of levels, random forests are
biased in favor of those attributes with more levels. Therefore, the variable importance
scores from random forest are not reliable for this type of data.
Splines: A linear spline is a continuous function formed by connecting linear segments. The points
where the segments connect are called the knots of the spline.
8 U.Padma Jyothi, CSE Dept , VITB