Nothing Special   »   [go: up one dir, main page]

ACT6100 A2020 Sup 12

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Data Science for Actuaries (ACT6100)

Arthur Charpentier

Aggregation # 5.1 (Ensemble & Parallel)

automne 2020

https://github.com/freakonometrics/ACT6100/

@freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 37


Apprentissage en parallèle ou en série ?
On va construire ici un ensemble de modèles

@freakonometrics freakonometrics freakonometrics.hypotheses.org 2 / 37


Apprentissage

Considérons un échantillon i.i.d. {(x i , yi ) ∈ Rp × Y},


une prévision est la construction d’une fonction f : Rp → Y
E.g. f ? (x) = E[Y |X = x], f ? = argmin{E[(Y − f (x))2 ]}
Comme f ? est inconnue (on ne connait pas la loi de (X , Y ), on
construit fbn aussi proche que possible de f .
fbn est consistante pour P si
Z 
? 2
lim E [f (x) − fn (x)] dP(x) = 0
b
n→∞

et si fbn est consistante pour toute distribution, on parle de


consistance universelle.
Il existe des estimateurs universellement consistant.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 3 / 37


Apprentissage
En classification, le classifier de Bayes est

f ? (x) = 1 P[Y = 1|X = x] ≥ P[Y = 0|X = x]




la éthode des k-plus proches voisins


k k
!
X X
fbn (x) = 1 1(yi:x = 1) ≥ 1(yi:x = 0)
i=1 i=1

où yj:x est le label de la j-ème plus proche observation de x parmi


x 1 , · · · , x n , et la méthode du noyau
n
X n
X
fbn (x) = 1 1(yi = 1)1(kx − x i k ≤ h) ≥ 1(yi = 0)1(kx − x i k ≤ h
i=1 i=1

sont des classifieurs universellement consistant, si k → ∞ mais


k/n → 0, et h → ∞ mais nhp → 0 lorsque n → ∞.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4 / 37
Ensemble Method
Galton(1907, Vox Populi) guess the
weight of the cow, county fair in Corn-
wall, England, 1906
Correct answer = 1198 lbs
787 participants, x1 , · · · , xn
I quantile 25%: 1162 lbs
I quantile 50%: 1207 lbs
I quantile 75%: 1236 lbs
I average: 1197 lbs
Pick a single prediction xj v.s average x,
n
1X
E (xj − t)2 ] = (x − t)2 + (xi − x)2
 
n
i=1

where t is the truth


(ambiguity decomposition)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5 / 37
Ensemble Method

I statistical interpretation Hopefully the ensemble generalises


better than a single chosen model
I computational interpretation Averaging can sometimes be a
fast way of reaching closer to the optimal than direct
optimisation
I representational interpretation Averaging models of some
model class can sometimes take you outside of that model
class
Regression problem: averaging
Classification problem: voting (majority)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 6 / 37


Votes
Condorcet (1785, Essai sur l’application de l’analyse à la
probabilité des décisions rendues à la pluralité des voix)
p: probabilité de se tromper, individuellement
La probabilité que la majorité se trompe (avec n votant)
 
X n k
p (1 − p)n−k
k
k≥[(n+1)/2]

e.g. n = 11 et p = 30%, and p = 40%

Probability that the majority is wrong, 7.8% and 24.6%.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 7 / 37


Votes
Evolution of the probability that the majority is wrong, as a
function of n, for various p

but this is valid only if votes are independent !


In ensemble learning, we want
I models to be reasonably∗ accurate
I models to be as independent as possible
∗ ensemble learning can turn a collection of poor learners (stumps
(single-node decision trees) or linear) into a good one

@freakonometrics freakonometrics freakonometrics.hypotheses.org 8 / 37


Independence ?
with correlation among prediction, there is no real ‘decrease’

I split training sample into m groups of n/m observations


pb, groups are small, poor predictions
I split features into m groups
pb, poor predictions if only small number of (true) predictors
I use boostrap
pb if models are not independent enough

@freakonometrics freakonometrics freakonometrics.hypotheses.org 9 / 37


Bayesian model averaging
Consider m models, suppose that one is the true one, but we don’t
know which one...
Let T denote the choice of the model, T ∈ {1, 2, · · · , m},
m
X
P[Y = 1|X = x] = P[Y = 1, T = t|X = x]
t=1

m
X
P[Y = 1|X = x] = P[Y = 1|X = x, T = t] P[T = t|X = x]
| {z }| {z }
t=1 ωt
yb(t)

which is a weighted average of predictions (or posterior class


prediction), where weights

ωt = P[T = t|X = x] ∝ P[X = x|T = t]

are likelihoods of models.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 10 / 37


Stacking

Estimating model likelihood can be complicated


we can learn weights using a regression
(where individual model outputs are treated as features)

1 url = " http :// freakonometrics . free . fr / german_credit .


csv "
2 credit = read . csv ( url , header = TRUE , sep = " ,")
3 F = c (1 ,2 ,4 ,5 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,15 ,16 ,17 ,18 ,19 ,20)
4 for ( i in F ) credit [ , i ]= as . factor ( credit [ , i ])

create a training and a testing datasets


1 set . seed (123)
2 i_test = sample (1: nrow ( credit ) , size =333)
3 i_train = (1: nrow ( credit ) ) [ - i_test ]

@freakonometrics freakonometrics freakonometrics.hypotheses.org 11 / 37


Credit scoring with ensemble techniques
I model 1 : logistic regression
1 GLM = glm ( Creditability ~. , data = credit [ i_train ,] ,
family = binomial )
2 pGLM = predict ( GLM , newdata = credit [ i_test ,] , type ="
response ")
3 classGLM = c (0 ,1) [1+( pGLM >.5) *1]

I model 2 : classification tree


1 library ( rpart )
2 CART = rpart ( Creditability ~. , data = credit [ i_train ,])
3 classCART = predict ( CART , newdata = credit [ i_test ,] ,
type =" class ")

I model 3 : SVM (with a Gaussian kernel)


1 library ( kernlab )
2 SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1)
3 classSVM = predict ( SVM , newdata = credit [ i_test ,])

@freakonometrics freakonometrics freakonometrics.hypotheses.org 12 / 37


Credit scoring with ensemble techniques
On the correlation of those 3 models,
1 > df = data . frame ( as . numeric ( classCART ) -1 ,
2 + as . numeric ( classGLM ) ,
3 + as . numeric ( classSVM ) -1)
4 > names ( df ) = c (" CART " ," GLM " ," SVM ")
5 > cor ( df )
6 CART GLM SVM
7 CART 1.0000000 0.3777031 0.4888801
8 GLM 0.3777031 1.0000000 0.5528032
9 SVM 0.4888801 0.5528032 1.0000000

Consider a simple (majority based) voting procedure


1 > vote = ifelse ( rowSums ( df ) >1.5 ,1 ,0)

Use AUC a classification metrics


1 > library ( caret )
2 > auc = function ( y ) caret :: confusionMatrix ( data = as .
factor ( y ) , reference = credit [ i_test ," Creditability
"]) $overall [" Accuracy "]
3 > y = credit [ i_test ," Creditability "]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13 / 37
Credit scoring with ensemble techniques
On the SVM
On the classification tree,
1 > table ( df$SVM , y )
1 > table ( df$CART , y )
2 0 1
2 0 1
3 0 33 26
3 0 42 46
4 1 61 213
4 1 52 193
5 > auc ( df$SVM )
5 > auc ( df$CART )
6 Accuracy
6 Accuracy
7 0.7387387
7 0.7057057
on the majority votes
on the GLM
1 > table ( vote , credit [
1 > table ( df$GLM , y )
i_test , y ])
2 0 1
2 vote 0 1
3 0 43 43
3 0 37 29
4 1 51 196
4 1 57 210
5 > auc ( df$GLM )
5 > auc ( vote )
6 Accuracy
6 Accuracy
7 0.7177177
7 0.7417417

@freakonometrics freakonometrics freakonometrics.hypotheses.org 14 / 37


Credit scoring with ensemble techniques

1 > df$CREDIT = as . numeric ( credit [ i_test ," Creditability


"]) -1

1 > reglm = lm ( CREDIT ~ CART + GLM + SVM , data = df )


2 > df$STACKLM = predict ( reglm )
3 > auc (( df$STACKLM >.5) *1)
4 Accuracy
5 0.7627628

1 > reg = glm ( CREDIT ~ CART + GLM + SVM , data = df , family =


binomial )
2 > df$STACKGLM = predict ( reg , type =" response ")
3 > auc (( df$STACKGLM >.5) *1)
4 Accuracy
5 0.7627628

idea of slacking

@freakonometrics freakonometrics freakonometrics.hypotheses.org 15 / 37


Credit scoring with ensemble techniques
Instead of a (majority based) voting procedure, why not consider
predictive probabilities?
1 > pGLM = pGLM
2 > pCART = predict ( CART , newdata = credit [ i_test ,] , type
=" prob ") [ ,2]
3 > SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1 , prob . model = TRUE )
4 > pSVM = predict ( SVM , newdata = credit [ i_test ,] , type ="
probabilities ") [ ,2]
5 > pdf = data . frame ( as . numeric ( pCART ) ,
6 + as . numeric ( pGLM ) ,
7 + as . numeric ( pSVM ) )
8 > names ( pdf ) = c (" CART " ," GLM " ," SVM ")
9 > cor ( pdf )
10 CART GLM SVM
11 CART 1.0000000 0.516073 0.6101085
12 GLM 0.5160730 1.000000 0.8641040
13 SVM 0.6101085 0.864104 1.0000000

@freakonometrics freakonometrics freakonometrics.hypotheses.org 16 / 37


Credit scoring with ensemble techniques
Instead of a (majority based) voting procedure, why not consider
predictive probabilities?
1 > pGLM = pGLM
2 > pCART = predict ( CART , newdata = credit [ i_test ,] , type
=" prob ") [ ,2]
3 > SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1 , prob . model = TRUE )
4 > pSVM = predict ( SVM , newdata = credit [ i_test ,] , type ="
probabilities ") [ ,2]
5 > pdf = data . frame ( as . numeric ( pCART ) ,
6 + as . numeric ( pGLM ) ,
7 + as . numeric ( pSVM ) )
8 > names ( pdf ) = c (" CART " ," GLM " ," SVM ")
9 > cor ( pdf )
10 CART GLM SVM
11 CART 1.0000000 0.516073 0.6101085
12 GLM 0.5160730 1.000000 0.8641040
13 SVM 0.6101085 0.864104 1.0000000

@freakonometrics freakonometrics freakonometrics.hypotheses.org 17 / 37


Credit scoring with ensemble techniques

One can consider the average (probability) prediction


 1
ybi = average ybitree , ybiglm , ybisvm = ybitree + ybiglm + ybisvm

3

1 > auc (( apply ( pdf ,1 , mean ) >.5) *1)


2 Accuracy
3 0.7417417

or
ybi = median ybitree , ybiglm , ybisvm


1 > auc (( apply ( pdf ,1 , median ) >.5) *1)


2 Accuracy
3 0.7477477

@freakonometrics freakonometrics freakonometrics.hypotheses.org 18 / 37


Ensemble techniques: Netflix price

2009: US$1,000,000 Netflix Prize, Winner BellKor’s Pragmatic


Chaos
ensemble of 107 models
“Our experience is that most efforts should be concentrated in
deriving substantially different approaches, rather than refining a
simple technique [...] We strongly believe that the success of an
ensemble approach depends on the ability of its various predictors
to expose different complementing aspects of the data. Experience
shows that this is very different than optimizing the accuracy of
each individual predictor”

@freakonometrics freakonometrics freakonometrics.hypotheses.org 19 / 37


(un)-stability of trees

1 > variable = rep ( NA ,10000)


2 > for ( i in 1:10000) {
3 + arbre = rpart ( PRONO ~. , data =
myocarde [ sample (1:71 , size =47)
,])
4 + variable [ i ] = as . character (
arbre$frame [1 ," var "])
5 + }
6 > table ( variable ) /100
7 INCAR INSYS REPUL
8 24.12 47.72 28.16

the first split in the tree in the myocarde


dataset is
I INSYS (47.7%)
I REPUL (28.2%)
I INCAR (24.1%)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 20 / 37


Principe

I Les arbres sont peu robustes (sur la forme, pas forcément la


prévision)
I L’approche présentée avec les arbres de décision conduit à des
prédictions ayant une grande variance
I Le bagging, ou bootstrap aggregating est un ensemble de
(méta-)algorithmes conçus pour améliorer la stabilité et la
précision d’un modèle (ou d’un algorithme) utilisé. Il permet
également de réduire la variance et d’éviter le surajustement.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 21 / 37


Algorithme Bagging

1. Générer B bases de données d’entrainement différentes.


2. À partir de chacune des bases de données d’entrainement,
construire un modèle fbb , b = 1, . . . , B.
3. Pour une nouvelle observation x, la prédiction agrégée
(bagging) sera
B
1 Xb
fbag (x) = fb (x).
B
b=1

@freakonometrics freakonometrics freakonometrics.hypotheses.org 22 / 37


Étape 1
(1) Générer les B bases de données.
I En pratique, on n’a pas accès à B bases de données
différentes.
I Si on divise la base de données initiale en B bases
indépendantes (comme pour de la validation croisée), on perd
de l’information et les bases ainsi créées sont trop petites (il
faut que B soit grand pour que le Bagging soit intéressant).
I On va plutôt utiliser du bootstrap pour générer ces B bases
de données.
I La base de données b, b = 1, . . . , B, de taille n est obtenue en
pigeant au hasard avec remise n observations de la base de
données initiale.
I Les B bases de données sont créées à partir de la même base
initiale → les prédictions obtenues fbb ne sont pas
indépendantes.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 23 / 37


Étape 2

(2) Construction du modèle b, b = 1, . . . , B.


I En pratique, quand la technique du bagging est utilisée pour
des arbres de décision, ces derniers ne sont pas élagués.
I Chacun des arbres b, b = 1, . . . , B, a donc un faible biais mais
une grande variance.
I La réduction de la variance se fait à l’étape 3 en agrégeant les
différents arbres obtenus.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 24 / 37


Étape 3

(3) Agrégation des modèles


B
!
 1 Xb
Var fbag (x) = Var fb (x)
B
b=1
B
!
1 X
= 2 Var fbb (x)
B
b=1
B B
1 X X
= Cov(fbb1 (x), fbb2 (x))
B2
b1 =1 b2 =1
1
≤ 2 B 2 Var(fbb (x)) = Var(fbb (x))
B

puisque lorsque b1 6= b2 , on a Corr[fbb1 (x), fbb2 (x)] ≤ 1.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 25 / 37


Étape 3

(3) Agrégation des modèles


En fait, si Var(fbb (x)) = σ 2 (x) et Corr[fbb1 (x), fbb2 (x)] = r (x),

1 − r (x) 2
Var fbag (x) = r (x)σ 2 (x) +

σ (x)
B
La variable sera d’autant plus faible que l’on aggrège des modèles
différents.
L’instabilité des arbres en fait de bons candidats pour de
l’aggrégation

@freakonometrics freakonometrics freakonometrics.hypotheses.org 26 / 37


Algorithme Bagging

I Empiriquement, on observe qu’agréger des centaines, voire


des milliers d’arbres (B = 100, 1000, 10000, . . .) augmente
beaucoup le pouvoir prédictif des arbres de décision.
I Par contre, les résultats sont plus difficiles à interpréter: pas
de graphique simple, etc.
I Les logiciels qui offrent ce type de méta-algorithme proposent
généralement des méthodes permettant de mesurer
l’importance des variables explicatives dans le modèle.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 27 / 37


Random Forest

Algorithm 1: Random Forest


1 initialization : m ≤ p (number of features considered to split a
node), B (number of trees);
2 for b = 1, 2, ...B do
3 generate a bootstrap sample Dnb ;
4 fbb ← tree (CART), where each split is done by minimizing
some cost over a set of m features randomly chosen
B
1 Xb
5 frf (x) =
b fb (x)
B
b=1
The smaller m, the smaller the correlation between trees

Classically, m = p/3 for regression, m = p for classification

@freakonometrics freakonometrics freakonometrics.hypotheses.org 28 / 37


Random Forest

I erreur out-of-bag
/ Dnb ,
Soit Bi ⊂ {1, 2, · · · , B} l’ensemble des arbres tels que i ∈

1 Xb
ybi = fb (x i )
|Bi |
b∈Bi

que l’on utilise pour mesurer un risque empirique out-of-bag


n n
1X 2 1X 
yi − ybi ou 1 yi 6= ybi
n n
i=1 i=1

I importance des variables (1)


cf discussion dans la partie 4 sur les arbres

@freakonometrics freakonometrics freakonometrics.hypotheses.org 29 / 37


Random Forest
I importance des variables (2)
Soit Ib l’échantillon associé à Db , et I b la partie out-of-bag. Sur
cet échantillon, on définie le risque out-of-bag
1 X b 2
Rbboob = fb (x i ) − yi
Ib
i∈I b

considérons une petite perturbation sur la variable j,


1 X b 2
Rbboob (j) = fb (xi1 , · · · , xij + ε, · · · , xik ) − yi
Ib
i∈I b

et on définie
B
1 X boob
Rb (j) − Rbboob

importancej =
B
b=1

@freakonometrics freakonometrics freakonometrics.hypotheses.org 30 / 37


Random Forest
1 > credit = read . csv ( url , header = TRUE , sep = " ,")
2 > F = c (1 ,2 ,4 ,5 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,15 ,16 ,17 ,18 ,19 ,20)
3 > for ( i in F ) credit [ , i ]= as . factor ( credit [ , i ])
4 > set . seed (123)
5 > i_test = sample (1: nrow ( credit ) , size =333)
6 > i_train = (1: nrow ( credit ) ) [ - i_test ]
7 > > set . seed (456)
8 > library ( randomForest )
9 > fit = randomForest ( Creditability ~. , data = credit )
10 > print ( fit )
11 Type of random forest : classification
12 Number of trees : 500
13 No . of variables tried at each split : 4
14
15 OOB estimate of error rate : 23.3%
16 Confusion matrix :
17 0 1 class . error
18 0 121 179 0.59666667
19 1 54 646 0.07714286

@freakonometrics freakonometrics freakonometrics.hypotheses.org 31 / 37


Random Forest
1 > varImpPlot ( fit )
2 > fit$importance
3 MeanDecreaseGini
4 Account . Balance 45.94
5 Duration . of . Cre 39.09
6 Payment . Status . 26.38
7 Purpose 36.61
8 Credit . Amount 50.31
9 Value . Savings . S 21.62
10 Length . of . curre 24.42
11 Instalment . per . 18.66
12 Sex ... Marital . S 14.87
13 Guarantors 7.62
14 Duration . in . Cur 18.48
15 Most . valuable . a 19.46
16 Age .. years . 38.45
17 Concurrent . Cred 9.51
18 Type . of . apartme 10.01
19 No . of . Credits . a 8.05
20 Occupation 13.18
21 No . of . dependent 4.86
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32 / 37
Random Forest

to visualize the impact of xj ’s on y


1 > credit = read . csv ( url , header
= TRUE , sep = " ,")
2 > plot ( Creditability ~ Credit
. Amount , data = credit )
3 > plot ( Creditability ~
Account . Balance , data =
credit
4 > plot ( Creditability ~
Duration . of . Credit .. month
. , data = credit )
5 > plot ( Creditability ~ Age ..
years . , data = credit )

@freakonometrics freakonometrics freakonometrics.hypotheses.org 33 / 37


Random Forest

One can look at the out-of-bag error


1 > set . seed (456)
2 > fit = randomForest (
Creditability ~ . , data =
credit , ntree = 5000 , mtry
= 4)
3 > plot ( fit$err . rate [ , 1])

or with m = 10, or m = 2

@freakonometrics freakonometrics freakonometrics.hypotheses.org 34 / 37


Random Forest
to visualize partial dependence plots
1 > lv = levels (
cre d i t $ C r e d i t a bility )
2 > partialPlot ( fit , credit , "
Credit . Amount " , lv [1])

1 > set . seed (456)


2 > fit_train = randomForest (
Creditability ~ . , data =
credit [ i_train ,] , ntree =
500 , mtry = 4)
3 > print ( fit_train )
4
5 OOB estimate of
error rate : 23.54%
6 Confusion matrix :
7 0 1 class . error
8 0 86 120 0.5825243
9 1 37 424 0.0802603

@freakonometrics freakonometrics freakonometrics.hypotheses.org 35 / 37


Random Forest

let us look at prediction from our random


forest model
1 y = credit [ i_train ," Creditability
"]
2 y_test = predict ( fit_train , type ="
prob " , newdata = credit [ i_train
,]) [ ,2]
3 library ( pROC )
4 roc_train = roc (y , y_test , plot =
TRUE , col =" blue ")
5 y = credit [ i_test ," Creditability "]
6 y_test = predict ( fit_train , type ="
prob " , newdata = credit [ i_test ,])
[ ,2]
7 roc_test = roc (y , y_test , plot =
TRUE , col =" red ")

@freakonometrics freakonometrics freakonometrics.hypotheses.org 36 / 37


Random Forest

here, AUC is
1 > pROC :: auc ( roc_test )
2 Area under the curve : 0.7783

to be compared with a (basic) GLM


1 > glm_train <- glm ( Creditability ~
. , data = credit [ i_train ,] ,
family = binomial )
2 > y = credit [ i_test ," Creditability
"]
3 > y_test = predict ( glm_train , type
=" response " , newdata = credit [
i_test ,])
4 > roc_test <- roc (y , y_test , plot =
TRUE , col = colr [2] , lwd =3)
5 > pROC :: auc ( roc_test )
6 Area under the curve : 0.72

@freakonometrics freakonometrics freakonometrics.hypotheses.org 37 / 37

You might also like