Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
Capstone Grp6 PREDICTING INSURANCE RENEWAL PROPENSITY v3
Post Graduate Program in Business Analytics Business Intelligence by Great Lakes Institute of Management
Submitted by
AVK Subrahmanyam
Shivangi Gupta(S.Id-IHMYYY212M)
(Richa Agarwal)
June 2020
ABSTRACT
The objective of this project is to predict the probability that a customer will default the premium payment, so that the
insurance agent can proactively reach out to the policy holder to follow up for the payment of premium. Premium paid by
the customer is the major revenue source for insurance companies. Default in premium payments results in significant
revenue losses and hence insurance companies prefers to know upfront which type of customers would default premium
payments.
2|Page
Techniques, Tools, Domain
Techniques Machine Learning using KNN, Logistic Regression, Ensemble Methods, XGBoost
Tools R, Tableau, Python, MS Excel
Domain Financial Analytics( Insurance Sector)
3|Page
CERTIFICATE
This is to certify that the participants AVK Subrahmanyam, Sweta Sahay, Shivangi Gupta, and Gaurav Chakraborty
who are the students of Great Lakes Institute of Management, have successfully completed their project on “Predicting
Insurance Renewal Propensity”.
This project is the record of authentic work carried out by them during the academic year 2019- 2020.
_______________
Richa Agarwal
(Mentor)
4|Page
ACKNOWLEDGEMENT
We would like to convey our sincere gratitude to our mentor Richa Agarwal without her timely guidance and able
mentorship, this project would might not been possible. Her deep understanding of the use case and business intellect
helped in charting the right approach and deploying the appropriate models for data analytics.
We would also like to thank the Great Lakes Institute of Management for giving us an opportunity to work on a project
assigned to them to showcase their students’ capabilities.
Would like to take this opportunity to thank all our faculties who have diligently worked to ensure that the concepts of
data science, business analytics and relevance of business context are well embed in the analytical solution that we build.
5|Page
Table of Contents
2. Project Objective
4. Research Approach
7. Bibliography
8. Annexures
6|Page
List of Tables and Figures
The dataset contains the following information about 79854 policy holders.
Summary Statistics
At a glance we have 4 floating point features, 1 integer variables, 4 ordinal integer features, 2 categorical text features
except id and renewal. For the numeric variables, we have the following statistic summary:
7|Page
no_of_premiums_paid 2 7 10 10.86 14 60
Abbreviations
No abbreviations were used in the entire report.
Executive Summary
This project helps an insurance company to build a model to predict the propensity to pay renewal premium and build an
incentive plan for its agents to maximize the net revenue. Available information includes past transactions from the policy
holders along with their demographics. The client has provided aggregated historical transaction data like number of
premiums delayed by 3/ 6/ 12 months across all the products, number of premiums paid, customer sourcing channel and
customer demographics like age, monthly income and area type. In addition to the information above, the client has
provided the following relationships. Given the information, the model predicts the propensity of renewal payment.
collection and create an incentive plan for their agents (at policy level) to maximize the net revenues from these policies.
Chapter 1
Project Introduction
Insurance is an instrument available to individuals and organizations to reduce the exposure of financial risk. It is a
contractual obligation between two parties, wherein one party (the insurer) agrees to pay another party (the insured) for
the agreed financial amount subject to happening of an agreed event. For this, the insured pays amount, known as
premium to the insurer in exchange of the protection to the financial amount as agreed upon. The contract of insurance is
based on 7 key principles –
1. Utmost Good Faith
2. Insurable Interest
3. Proximate Cause
4. Indemnity
5. Subrogation
6. Contribution
7. Loss Minimization
Premium paid by customer is a major source of revenue for the insurance companies. Default in premium payments
results in significant revenue losses and hence insurance companies put their efforts to minimize the leakage in revenue.
Life Insurance company spend heavy amounts in establishing marketing set ups and pay hefty first year commissions.
That increases the cost of acquisition of a new customer. Studies have shown that $1 paid towards customer retention
increases profits by more than $5 spent on new customer acquisition.
On the other hand, a regular payment of renewal premium indicates customer satisfaction. Higher incentives to the sales
force, higher profits to the company that may result in reduction of premium to the new policyholders.
Therefore, predicting the payment of renewal premium which is considered to be an early warning indicator is a sine qua
non for the insurance company
Project Objective
An Insurance company is interested to predict the probability that a customer will default the premium payment. This will
help in strategizing the agent force to reach out to policy holders in advance to follow up for payment of premium.
This is achieved by identifying the patterns of the default from the historical data & predict the default in premium
payment by employing appropriate model/s, from the armoury of machine learning and predictive analytics.
For this project, default in premium indicates customer has not renewed the premium.
Data Source
The dataset contains the following information about 79854 policy holders.
Techniques Machine Learning using KNN, Logistic Regression, Ensemble Methods, XGBoost
Tools R, Tableau, Python, MS Excel
Chapter 2
Literature Review
Insurance is an instrument available to individuals and organizations to reduce the exposure of financial risk. It is a
contractual obligation between two parties, wherein one party (the insurer) agrees to pay another party (the insured) for
the agreed financial amount subject to happening of an agreed event, as per the terms of the contract. For this, the insured
pays amount, known as premium to the insurer in exchange of the protection to the financial amount as agreed upon. The
contract of insurance is based on 7 key principles –
1. Utmost Good Faith
2. Insurable Interest
3. Proximate Cause
4. Indemnity
5. Subrogation
6. Contribution
7. Loss Minimization
Insurance companies operate on the proven principle of premia by many insurance policy holders to compensate the
financial loss suffered by the insured population.
9|Page
This is popularly known as pooling of funds and sharing the risk based on the rule of large numbers. The collection of
premium (major source of revenue) and judicious usage of revenue to meet the expenses form key to the survival and
growth of the industry. Insurance agents play predominant role in procuring and retaining the premia collection. The
premium paid in the first is year is called first year premium and from second year onwards is called as renewal premium.
Commissions are paid to gets to procure and retain collection of premia.
Insurance market can be categorized into life and general (non-life). In case of former, the life of the customer or his/her
dependents is the risk covered where is in general insurance it could be the health, vehicles or casualty.
Key Metrics
For understanding the significance of renewal premium, team looked at the performance metrics used for this purpose.
The two key metrics used in the industry for renewal are:
Persistency Ratio This is based on the number of policies renewed vis-a-vis previous year. It is calculated as below
(Total number of policies renewed/Total number of policies outstanding as at previous year end) * 100
Conservation Ratio This is based on the total premium amount that has renewed vis-à-vis total premium collected in
the previous year.
(Total Renewal premium collected in the current Year/ Total premium collected in the previous year) * 100
All insurance companies strive to maintain these two ratios at maximum possible highest rate. The influencers of
persistency rate are the three stakeholders in the industry: life insurers, agents, and customers. Strategies are formulated
and implemented to keep these ratios at an extremely high level among the three influencer classes. For the purpose of this
study, insurance company is aiming at the influencing the customers by providing guidelines to agents.
Chapter 3
Variable Rationalization
In order to study the data better, we performed a preliminary variable reduction in the beginning itself. At this stage, we
reduced the variable on the following criteria:
Redundant Variables
Business relevance
Correlated Variables
Clubbed Variables
Unimportant Variables
On checking the relationship between renewal premium payment to the below variables in EDA, intuitively it appears to
be unimportant variables, however significance can be interpreted post model validation, therefore retained for the
purpose of running models:
Accommodation
10 | P a g e
Marital Status
Number of vehicles owned &
No of dependents
Insights
We learn,
Very few people have defaulted their renewals more than 1 time (i.e. 12 months late payment). Although there
have been some who have defaulted renewals 3 times also
We observe that in all the 3 situations the people have defaulted once
If the count of payment is less then the chances to renew is more
Residence areas and sourcing channels do not affect the renewals
sourcing_channel and residence_area_type do not affect renewal
person who renewed are less likely to pay by cash or credit card, compared with those who did not renew.
person who did not renewed seemed to be younger than those who renewed.
Correlation
1. Age has medium negative correlation with percentage of premium paid by cash_credit and medium positive
correlation with the number of premiums paid.
2. Income is highly positive correlated with premium amount.
3. Number of premiums paid is moderately positively correlated with Income
4. Premium is highly correlated with Income
Research Approach
In the subsequent sections, we will create a predictive model based on logistic regression and other machine learning
models to understand the probability of default in premium payment.
Data preparation
Data Pre-processing
First we need to convert categorical string variables into number. Secondly, missing values needed to be treaded. We are
imputing missing values with K-Nearest Neighbors. Thirdly the number of late payment and percentage of cash/credit
card payment are the main factors contributing if the premium got renewed.
Outlier treatment
However, with a view that outliers exist in real time data and imputation or capping or removal results in data loss –
outliers were left as they are in the data set.
(Please refer to the graphical representation at the end of the document)
SMOTE technique has been applied as the give dataset is an imbalanced one as 93.74% policy holders renewed premium
and 6.26% not renewed. This statistical technique is used to increase the number of minority class in a balanced way.
SMOTE is applied in a graded manner to improve the class imbalance in steps, i.e at 11.7%.
After applying SMOTE, we have now balanced the target variable responder class in the training data.
Variable Transformation
For the purpose of model building and from thenceforth, age in days was converted into years and also applied binning.
Total Late Payments (by combining late Payments of 3 months, 6 months and 12 months)
12 | P a g e
Binning
Binning the outliers is the method used to classify data into categories to smoothen the presence of outliers. These bins
would be useful in providing insights of the category or categories where customers might default.
13 | P a g e
(Please refer to the graphical representation at the end of the document)
Chapter 4
Models Used
As the objective of this project is to predict the probability that a customer will default the premium payment or not, it is a
classification problem.
The dataset contains target variable “Renewal Premium”, wherein “0” represents that the customer has not renewed the
premium and “1” that customer has renewed the premium. Therefore, supervised learning algorithm needs to be applied
for this prediction.
Different supervised learning algorithms in classification problems that are applied are:
1. Logistic Regression:
Logistic Regression is a classification algorithm that estimates discrete values like yes/no, true/false, 0 or 1 etc.
This model is most useful for understanding the influence of several independent variables on a single outcome
variable. It works very well on linearly separable classes, making use of odds ratio and sigmoid function.
Interpretation
By using Logistic Regression, we have predicted the renewal (on train data) who will be defaulting with an accuracy of
over 89.27% which seems very good.
Also we have predicted the renewal (on test data) who will be defaulting with an accuracy of over 93.62% which seems
very good.
Gradient Boosting:
Gradient boosting is a machine learning technique for regression and classification problems, which produces a
prediction model in the form of an ensemble of weak prediction models, typically decision trees. Gradient
boosting decision trees is the state of the art for structured data problems. Two modern algorithms that make
gradient boosted tree models are XGBoost and LightGBM. In this article I’ll summarize their introductory papers
for each algorithm’s approach.
XGBoost
XG Boost or Extreme Gradient Boosting method is further improvement of Gradient Boosting method that uses more
approximations for finding the best tree model.
(Please refer to the graphical representation at the end of the document for the model)
15 | P a g e
In the above sections we have created prediction models using logistic regression, KNN, SVM,. Let us look at the
evaluation metrics of the these models on the basis of test data and interpret them.
Accur
acy : 0.9362
95% CI : (0.9331, Accuracy: 0.7875 Accuracy: 0.7897 Accuracy : 0.9355
0.9393)
No Information Rate :
0.9375
P-Value [Acc > NIR] :
0.8067
AUC: 0.6889 AUC: 0.7381 AUC Score : 0.854821
Kappa : 0.2887
'Positive' Class : 0
Interpretation: By using Interpretation: By using Interpretation: By using Interpretation: By
logistic regression we have KNN we have predicted the SVM we have predicted the using XGBoost we have
predicted the accuracy to be accuracy to be 78.75% accuracy to be 78.97% predicted the accuracy to
93.62%(highest) be 93.55%
Chapter 5
Comparing the Accuracy, Sensitivity and specificity measures of the classification matrices (on test data) of all the
models created so far, we can say that the logistic regression algorithm has worked the best for this dataset although
the bagging and boosting models are close enough.
However, one must consider the fact that we had balanced the target variable responder class using SMOTE. The results
would have been different if we had used the dataset as it is.
The objective we have here is that an Insurance company is look for a practicable model to predict the probability that a
customer will default the premium payment. This will help in strategizing the agent force to reach out to policy holders in
advance to follow up for payment of premium.
1. Based on the Variable importance of the Logistic Regression model, the insurance company is suggested to orient
its agents force to contact policy holders for Renewal premium as per the below criteria.
Measure Criteria
Age 1. Between 30-40 years
2. Between 50-70 years
Marital Status Married
Income 1. Between 50000-100000
2. Equal to 300000 and above
Premium 15,000 and above
Risk score 94%-96%
Sourcing Channel C, D & E
Late Payment (3-6 5-10 times
months)
Late Payment (6-12 1. 5-10 times
months) 2. 10-15 times
2. Age: Age between 30-40 years and between 50-70 years the number of defaults is higher. As we saw in the EDA,
the mean age of a policyholder making renewal payment is around 51 years (18.847 days).
3. Number of dependents: Policy holders with number of dependents being 4 is a segment to be focussed for
issuance of new policies
4. Marital Status: Married category policyholders are to be contacted by agents to ensure receipt of renewal
premium.
5. Urban and Income: Policyholders in Urban are to be contacted by agents to ensure receipt of renewal premium.
17 | P a g e
6. Income: Policy holders with the income in the ranges of 50,000 to 100,000 & above 300,000 are to be contacted
by agents to ensure receipt of renewal premium
7. As the customers coming from sourcing channels A & B are more regular in payment of premium, business
from them needs encouragement. At the same time, the business coming from sourcing channels C, D & E is
subjected to closer scrutiny at the time of accepting the business.
8. Insurance company has to closely watch and ensure that the late payments of 3-6 & 6-12 months of a
policyholder are not going above 5. This will be proactive step to reduce the criteria of follow up by agents.
Logistics Regression Model and XGBoost are able to predict the renewal premium default with very high accuracy. In this
case, any of the models Logistics Regression, XGBoost can be used for high accuracy prediction. However, the key aspect
is SMOTE for balancing the minority and majority class, without which our models will not be so accurate.
Logistic Regression & XGBoost are the best models that a customer may default the premium payment. From XGBoost
we are focusing on feature importance variables which needs to be focused and what is the probability rate needs to be
considered based on decision is recommended.
With regard to implementation, the insurance company is suggested to adopt a three pronged approach to policy holders
for pursuing renewal premium payment.
Based on the model interpretation metrics,
For True Positives the insurance company needs to adopt multiple modes of follow up including sending
a person to provide clarifications or giving comfort to the policyholder for renewal. These have to started
well before due date of renewal premium payment.
For the mid-category, a medium approach for follow up and at a mid-interval time period is fine.
For the policyholders who have been regularly paying premium a reminder by email and sms would be
fine.
Bibliography
https://www.inc.com/encyclopedia/insurance-pooling.html
https://www.ibef.org/industry/insurance-sector-india.aspx
The Indian Insurance Industry Report – 2018 “ by The India Insure Risk Management & Insurance Broking
Services Pvt Ltd. – page 48
Graphical Representations
18 | P a g e
19 | P a g e
20 | P a g e
Exploratory Data Analysis
Unimportant Variable
21 | P a g e
Correlation Heatmap
22 | P a g e
XGBoost Graphs
Logloss for comparing the test & train dataset for prediction
23 | P a g e
Annexures
24 | P a g e