Predictive Modelling - Final Project Report-Logistic Regression and LDA
Predictive Modelling - Final Project Report-Logistic Regression and LDA
Predictive Modelling - Final Project Report-Logistic Regression and LDA
PREDICTIVE MODELLING
August ‘21
Date: 27/08/2021
2
Table of Contents
Table of Contents..................................................................................................................................2
Table of Figures.....................................................................................................................................3
1. Executive Summary.......................................................................................................................4
2. Introduction...................................................................................................................................4
3. Data Details...................................................................................................................................5
4. Data Ingestion and EDA - Descriptive Statistics, Duplicate/Null value Check, Uni-/Bi-Variate
Analysis, Outlier Check..........................................................................................................................5
4.1 Data Ingestion........................................................................................................................5
Sample of the Dataset...................................................................................................................5
Data Info........................................................................................................................................6
Data Shape & Data Types..............................................................................................................6
Data Description............................................................................................................................7
Duplicate Value Check...................................................................................................................7
NULL Value Check..........................................................................................................................7
Inference.......................................................................................................................................8
4.2 Univariate Analysis................................................................................................................8
Distribution Plots...........................................................................................................................8
Count Plots..................................................................................................................................10
Inference.....................................................................................................................................11
4.3 Transforming Categorical variables to Numeric Variables...................................................11
4.4 Bivariate Analysis.................................................................................................................12
Pairplots.......................................................................................................................................12
Correlation Heatmap...................................................................................................................13
4.5 Outlier Check.......................................................................................................................13
5. Logistic Regression.......................................................................................................................17
5.1. Preparing for Model Formulation........................................................................................17
5.2. Formulating a Logistic Regression Model on the Training data...........................................17
6. Logistic Regression: Performance Metrics...................................................................................17
6.1 LR: Model Score...................................................................................................................17
6.2 LR: Confusion Matrix............................................................................................................17
6.3 LR: Classification Report......................................................................................................18
6.4 LR: AUC Score......................................................................................................................18
6.5 LR: ROC Curve......................................................................................................................19
7. Linear Discriminant Analysis........................................................................................................20
3
Table of Figure
Table of Tables
1. Executive Summary
A tour and travel agency which deals with holiday package sales has shared a dataset of 872
employees of a company who were offered a holiday package by the agency. Some of these
employees opted for the package while some did not. The dataset contains a number of
variables and the information on these variables for the various employees is provided, along
with the information on whether an employee opted for a holiday package or not.
The company is trying to predict whether an employee will opt for a holiday package on the
basis of the values of the other attributes/characteristics. Basically, the agency wants a model
which offers a way of predicting the action of an employee to opt (or not opt) for a holiday
package. They also want to understand the factors (from these variables in the dataset) that
are crucial in determining whether a given employee purchases a holiday package, so that the
agency can focus its resources to sell their package to such employees. So, as part of this
project report, I will try to explore the various attributes provided in the dataset and their
2. Introduction
The intent for this entire exercise is perform analysis on the holiday package dataset. We will
try to explore this dataset by using descriptive statistics, univariate and bivariate analysis,
Exploratory Data Analysis and applying LDA and Logistic Regression. This dataset contains
details on about 872 employee, and I will try to analyse the various attributes provided about
the employees in the dataset, and determine how they contribute in determining whether an
3. Data Details
The first column contains an index variable, which is simply the serial number of the entry. I
dropped the index column as it is useless for the model. Following are the data variables:
As we can see, the holiday package dataset has 7 variables with each employee having the
same set of characteristics. Here, we can say that Holliday_Package is the dependent or target
feature and the rest of the variables are the independent or predictor variables. Based on the
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 872 non-null int64
1 Holliday_Package 872 non-null object
2 Salary 872 non-null int64
3 age 872 non-null int64
4 educ 872 non-null int64
5 no_young_children 872 non-null int64
6 no_older_children 872 non-null int64
7 foreign 872 non-null object
dtypes: int64(6), object(2)
memory usage: 54.6+ KB
It can be observed that after dropping the serial number column, the dataset has 872
employee records and 7 variables. Of the 7 variables, the target feature Holliday_Package and
foreign are object types while the other are integer type.
7
Data Description
Salary age educ no_young_children no_older_children
Holliday_Package foreign
unique 2 2
top no no
Inference
1. Our Target Feature is the Holliday_Package variable. It has a binary value which
equals 0 for employees not opting to purchase the holiday package and 1 for the
2. All other variables except foreign, like age, educ are continuous numeric variables
3. Since, the target feature Holliday_Package and the variable foreign are object type,
we need to encode them and convert them into categorical variables for creating a
4. The first column contained an index variable, which is simply the serial number of the
Distribution Plots
I plotted the numeric data variables below.
9
Count Plots
I plotted the categorical data variables below.
Inference
1. Nearly 46% of employees have opted for the Holiday Package
4. Majority of the employees who have children have 1 younger children (kids less than
7 years of age) and/or 1-2 older children (kids more than 7 years of age)
5. The average years of education range from 3-17; while the major chunk of employees
6. Majority of the employees (more than two-thirds) fall in the $25000 to $55000 range
of salary.
7. Nearly all the employees fall in the 20-60 years of age range.
I encoded the object type/categorical variables Holliday_Package, and foreign using Panda’s
.codes function
12
Before proceeding with Bivariate Analysis, I transformed the categorical variables to numeric
variables.
Pairplots
Correlation Heatmap
From the Pair plots and the Correlation Heatmap, it can be observed that the variables in the
Holiday Package dataset form separated clusters and are not much corelated with one
another. Maximum correlation between two variables is between education is salary and that
too is 0.33 only. Also, whatever limited correlation our target feature Holliday_Package has,
As we can see from the Box Plots above, mostly the variable Salary has outliers, which may impact
the efficacy of the regression model I will build. I have treated the outliers in the dataset
using the 25th and 75th percentiles. Post that, I re-checked for outliers once more (please see
below).
16
Figure 11: Box Plots for Continuous Variables post Outlier Treatment
17
5. Logistic Regression
I dropped the target feature Holliday_Package and stored it separately. Then I have used the
randomized training and test data splitting function from Sklearn package to split the data
into train and test datasets in the ratio 70:30 (The test data size should be 30% of the total
data).
.37)*height + Intercept
1. The Accuracy Score for the Regression Model on Training data is 0.51967
2. The Accuracy Score for the Regression Model on Test data is 0.53053
[[294 32]
[261 23]]
[[129 16]
[107 10]]
18
AUCTrain: 0.567
AUCTest: 0.627
19
I used a separate jupyter notebook for my Linear Discriminant Analysis of the Holiday
Package data.
function.
into separate vectors for Training set and Test set. Then I have used the randomized training
and test data splitting function from Sklearn package to split the data into train and test
datasets in the ratio 70:30 (The test data size should be 30% of the total data).
1. The Accuracy Score for the LDA model on Training data is 0.6721
2. The Accuracy Score for the LDA model on Test data is 0.6412
[[252 74]
[126 158]]
[[103 42]
[52 65]]
AUCTrain: 0.742
AUCTest: 0.703
22
9. Programming Files
10.1 Insights
1. The model score for the Logistics Regression’s Training dataset is 51.9% and that for
RecallLDA_Train = 8% | RecallLDA_Test = 9%
3. The AUC for Training data is 56.7% and Test data is 62.7%
4. The Logistic Regression model doesn’t seem to be a good fit and may needs improvement
5. The model score for the LDA’s Training dataset is 67.21% and that for Test dataset is
64.12%. As we can see the accuracy score of LDA model is better than that of Logistic
Regression model
Clearly the Classification report for the LDA model is better than that of the Logistic
Regression model
7. The AUC for Training data is 74.2% and Test data is 70.3% which quite similar
Again the AUC for the LDA model is better than that of the Logistic Regression model
24
8. As we can see that the Logistic regression model and the LDA model are able to predict
the behaviour of the employees regarding opting the Holiday Package for around 53%
and 62.7% of the employees with accuracy. Since the accuracy score of LDA model is
9. Another thing is that the LDA model can correctly predict the employee behaviour 56%
of the times while the Logistic Regression model can correctly predicts for only 8-9% of
the times.
10. From the above I can say that the LDA model is better for the travel agency to use in an
10.2 Recommendations
1. The dataset has outliers in the salary variable. We know that Logistic Regression is a
better predictor when outliers are present. Therefore, it is recommended to treat the
2. As we saw above, if an employee is a foreigner and the employee does not have any
young children (can be seen using Holliday_Package as hue in bivariate plots), the
employees who have older children do not go for the Holiday Package. So, the agency
can devise special promotional programs and discounts to such employees to incentivize
3. Moreover, a lot of employees having higher salary are not purchasing the Holiday
Package (again, can be seen using Holliday_Package as hue in bivariate plots). SO the
travel agency can come up with a plan to create more product awareness and introduce
4. The age of the employee is not a material in opting for holiday package, so it can be
ignored.
5. It was observed from the correlation coefficients that the target feature Holliday_Package
has a high negative correlation with no_young_children. So, it would go a long way if the
travel agency can tailor their holiday packages so as to make them more appealing to