Get ready to learn how to predict credit defaults with R
+ H2O
!
-
Data is Credit Loan Applications to a Bank.
-
Objective is to assess Risk Of Default, prevent bad loans, save bank lots of $$$
-
Best Kagglers got 0.80 AUC with more 100's of manhours, feature engineering, combining more data sets
-
We'll get 0.74 AUC in 30 minutes of coding (+1.5 hour of explaining)
-
Kaggle Competition: Home Credit Default Risk
-
Data is large (166MB unzipped, 308K rows, 122 columns)
-
Will work with sampled data 20% to keep manageable
The goal of Machine Learning with H2O is to get you experience with:
-
The R programming language
-
h2o
for machine learning -
lime
for feature explanation -
recipes
for preprocessing
-
This 3 hour workshop will teach you some of the latest tools & techniques for Machine Learning in business
-
With this said, you will spend 5% of your time on modeling (machine learning) & 95% of your time:
- Managing projects
- Collecting & working with data (manipulating, combining, cleaning)
- Visualizing information - showing the size of problems and what is likely contributing
- Communicating results in terms the business cares about
- Recommending actions that improve the business
-
Further, your organization will be keenly aware of what you contribute financially. You need to show them Return on Investment (ROI). They are making an investment in having a data science team. They expect tangible results.
-
Important Actions:
-
Attend my talk on the Business Science Problem Framework tomorrow. The BSPF is the essential system that enables driving ROI with data science.
-
Take my DS4B 201-R course. This teaches you a 10-Week Program that has cut data science projects in half for consultants and has progressed data scientists more than any other course they've take. You will get 20% OFF (expires after DSGO conference).
-
pkgs <- c("h2o", "tidyverse", "rsample", "recipes", "lime")
install.packages(pkgs)
Test H2O - You may need the Java Developer Kit
library(h2o)
h2o.init()
If H2O cannot connect, you probably need to install Java.
Wait for instructions from Matt.
The URL for the GitHub project is:
https://github.com/business-science/workshop_2018_dsgo
Skip this step if you already have Docker Community Edition installed
Docker Community Edition Installation Instructions
In a terminal / command line, run the following command to download and install the workshop container. This will take a few minutes to load.
docker run -d -p 8787:8787 -v "`pwd`":/home/rstudio/working -e PASSWORD=rstudio -e ROOT=TRUE mdancho/workshop_2018_dsgo
Go into you favorite browser (I'll be using Chrome), and enter the following in the web address field.
localhost:8787
Use the following credentials.
- User Name: rstudio
- Password: rstudio
Wait for instructions from Matt.
The URL for the GitHub project is:
https://github.com/business-science/workshop_2018_dsgo
-
tidyverse
: A meta-package for data wrangling and visualization. Loadsdplyr
,ggplot2
, and a number of essential packages for working with data. Documentation: https://www.tidyverse.org/ -
recipes
: A preprocessing package that includes many standard preprocessing steps. Documentation: https://tidymodels.github.io/recipes/ -
h2o
: A high-performance machine learning library that is scalable and is optimized for perfromance. Documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html-
GLM: Elastic Net (Generalized Linear Regression with L1 + L2 Regularization)
-
GBM: Gradient Boosted Machines (Tree-Based + Boosting)
-
Random Forest: Tree Based + Bagging
-
Deep Learning: Neural Network
-
Automated Machine Learning: Stacked Ensemble, All Models and Best of Family
-
-
lime
: A package for explaining black-box models. LIME Tutorial: https://www.business-science.io/business/2018/06/25/lime-local-feature-interpretation.html