ACT6100 A2020 Sup 12

Data Science for Actuaries (ACT6100)
Arthur Charpentier
Aggregation # 5.1 (Ensemble & Parallel)
automne 2020
https://github.com/freakonometrics/ACT6100/
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 37

Apprentissage en parallèle ou en série ?
On va construire ici un ensemble de modèles

Apprentissage
Considérons un échantillon i.i.d. {(x i , yi ) ∈ Rp × Y},

une prévision est la construction d’une fonction f : Rp → Y
E.g. f ? (x) = E[Y |X = x], f ? = argmin{E[(Y − f (x))2 ]}
Comme f ? est inconnue (on ne connait pas la loi de (X , Y ), on
construit fbn aussi proche que possible de f .
fbn est consistante pour P si
Z
? 2
lim E [f (x) − fn (x)] dP(x) = 0
b
n→∞
et si fbn est consistante pour toute distribution, on parle de

consistance universelle.
Il existe des estimateurs universellement consistant.

Apprentissage
En classification, le classifier de Bayes est
f ? (x) = 1 P[Y = 1|X = x] ≥ P[Y = 0|X = x]

la éthode des k-plus proches voisins

k k
!
X X
fbn (x) = 1 1(yi:x = 1) ≥ 1(yi:x = 0)
i=1 i=1
où yj:x est le label de la j-ème plus proche observation de x parmi

x 1 , · · · , x n , et la méthode du noyau
n
X n
X
fbn (x) = 1 1(yi = 1)1(kx − x i k ≤ h) ≥ 1(yi = 0)1(kx − x i k ≤ h
i=1 i=1
sont des classifieurs universellement consistant, si k → ∞ mais

k/n → 0, et h → ∞ mais nhp → 0 lorsque n → ∞.
Ensemble Method
Galton(1907, Vox Populi) guess the
weight of the cow, county fair in Corn-
wall, England, 1906
Correct answer = 1198 lbs
787 participants, x1 , · · · , xn
I quantile 25%: 1162 lbs
I average: 1197 lbs
Pick a single prediction xj v.s average x,
n
1X
E (xj − t)2 ] = (x − t)2 + (xi − x)2

n
i=1
where t is the truth

(ambiguity decomposition)
Ensemble Method
I statistical interpretation Hopefully the ensemble generalises

better than a single chosen model
I computational interpretation Averaging can sometimes be a
fast way of reaching closer to the optimal than direct
optimisation
I representational interpretation Averaging models of some
model class can sometimes take you outside of that model
class
Regression problem: averaging
Classification problem: voting (majority)

Votes
Condorcet (1785, Essai sur l’application de l’analyse à la
probabilité des décisions rendues à la pluralité des voix)
p: probabilité de se tromper, individuellement
La probabilité que la majorité se trompe (avec n votant)

X n k
p (1 − p)n−k
k
k≥[(n+1)/2]
e.g. n = 11 et p = 30%, and p = 40%
Probability that the majority is wrong, 7.8% and 24.6%.

Votes
Evolution of the probability that the majority is wrong, as a
function of n, for various p
but this is valid only if votes are independent !

In ensemble learning, we want
I models to be reasonably∗ accurate
I models to be as independent as possible
∗ ensemble learning can turn a collection of poor learners (stumps
(single-node decision trees) or linear) into a good one

Independence ?
with correlation among prediction, there is no real ‘decrease’
I split training sample into m groups of n/m observations

pb, groups are small, poor predictions
I split features into m groups
pb, poor predictions if only small number of (true) predictors
I use boostrap
pb if models are not independent enough

Bayesian model averaging
Consider m models, suppose that one is the true one, but we don’t
know which one...
Let T denote the choice of the model, T ∈ {1, 2, · · · , m},
m
X
P[Y = 1|X = x] = P[Y = 1, T = t|X = x]
t=1
m
X
P[Y = 1|X = x] = P[Y = 1|X = x, T = t] P[T = t|X = x]
| {z }| {z }
t=1 ωt
yb(t)
which is a weighted average of predictions (or posterior class

prediction), where weights
ωt = P[T = t|X = x] ∝ P[X = x|T = t]
are likelihoods of models.

Stacking
Estimating model likelihood can be complicated

we can learn weights using a regression
(where individual model outputs are treated as features)
1 url = " http :// freakonometrics . free . fr / german_credit .

csv "
2 credit = read . csv ( url , header = TRUE , sep = " ,")
3 F = c (1 ,2 ,4 ,5 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,15 ,16 ,17 ,18 ,19 ,20)
4 for ( i in F ) credit [ , i ]= as . factor ( credit [ , i ])
create a training and a testing datasets

1 set . seed (123)
2 i_test = sample (1: nrow ( credit ) , size =333)
3 i_train = (1: nrow ( credit ) ) [ - i_test ]

Credit scoring with ensemble techniques
I model 1 : logistic regression
1 GLM = glm ( Creditability ~. , data = credit [ i_train ,] ,
family = binomial )
2 pGLM = predict ( GLM , newdata = credit [ i_test ,] , type ="
response ")
3 classGLM = c (0 ,1) [1+( pGLM >.5) *1]
I model 2 : classification tree

1 library ( rpart )
2 CART = rpart ( Creditability ~. , data = credit [ i_train ,])
3 classCART = predict ( CART , newdata = credit [ i_test ,] ,
type =" class ")
I model 3 : SVM (with a Gaussian kernel)

1 library ( kernlab )
2 SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1)
3 classSVM = predict ( SVM , newdata = credit [ i_test ,])

On the correlation of those 3 models,
1 > df = data . frame ( as . numeric ( classCART ) -1 ,
2 + as . numeric ( classGLM ) ,
3 + as . numeric ( classSVM ) -1)
4 > names ( df ) = c (" CART " ," GLM " ," SVM ")
5 > cor ( df )
6 CART GLM SVM
7 CART 1.0000000 0.3777031 0.4888801
8 GLM 0.3777031 1.0000000 0.5528032
9 SVM 0.4888801 0.5528032 1.0000000
Consider a simple (majority based) voting procedure

1 > vote = ifelse ( rowSums ( df ) >1.5 ,1 ,0)
Use AUC a classification metrics

1 > library ( caret )
2 > auc = function ( y ) caret :: confusionMatrix ( data = as .
factor ( y ) , reference = credit [ i_test ," Creditability
"]) $overall [" Accuracy "]
3 > y = credit [ i_test ," Creditability "]
On the SVM
On the classification tree,
1 > table ( df$SVM , y )
1 > table ( df$CART , y )
2 0 1
2 0 1
3 0 33 26
3 0 42 46
4 1 61 213
4 1 52 193
5 > auc ( df$SVM )
5 > auc ( df$CART )
6 Accuracy
6 Accuracy
7 0.7387387
7 0.7057057
on the majority votes
on the GLM
1 > table ( vote , credit [
1 > table ( df$GLM , y )
i_test , y ])
2 0 1
2 vote 0 1
3 0 43 43
3 0 37 29
4 1 51 196
4 1 57 210
5 > auc ( df$GLM )
5 > auc ( vote )
6 Accuracy
6 Accuracy
7 0.7177177
7 0.7417417

1 > df$CREDIT = as . numeric ( credit [ i_test ," Creditability

"]) -1
1 > reglm = lm ( CREDIT ~ CART + GLM + SVM , data = df )

2 > df$STACKLM = predict ( reglm )
3 > auc (( df$STACKLM >.5) *1)
4 Accuracy
5 0.7627628
1 > reg = glm ( CREDIT ~ CART + GLM + SVM , data = df , family =

binomial )
2 > df$STACKGLM = predict ( reg , type =" response ")
3 > auc (( df$STACKGLM >.5) *1)
4 Accuracy
5 0.7627628
idea of slacking

Instead of a (majority based) voting procedure, why not consider
predictive probabilities?
1 > pGLM = pGLM
2 > pCART = predict ( CART , newdata = credit [ i_test ,] , type
=" prob ") [ ,2]
3 > SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1 , prob . model = TRUE )
4 > pSVM = predict ( SVM , newdata = credit [ i_test ,] , type ="
probabilities ") [ ,2]
5 > pdf = data . frame ( as . numeric ( pCART ) ,
6 + as . numeric ( pGLM ) ,
7 + as . numeric ( pSVM ) )
8 > names ( pdf ) = c (" CART " ," GLM " ," SVM ")
9 > cor ( pdf )
10 CART GLM SVM
11 CART 1.0000000 0.516073 0.6101085
12 GLM 0.5160730 1.000000 0.8641040
13 SVM 0.6101085 0.864104 1.0000000

Instead of a (majority based) voting procedure, why not consider
predictive probabilities?
1 > pGLM = pGLM
2 > pCART = predict ( CART , newdata = credit [ i_test ,] , type
=" prob ") [ ,2]
3 > SVM = ksvm ( Creditability ~. , data = credit [ i_train ,] ,
kernel = " rbfdot " , C =1 , prob . model = TRUE )
4 > pSVM = predict ( SVM , newdata = credit [ i_test ,] , type ="
probabilities ") [ ,2]
5 > pdf = data . frame ( as . numeric ( pCART ) ,
6 + as . numeric ( pGLM ) ,
7 + as . numeric ( pSVM ) )
8 > names ( pdf ) = c (" CART " ," GLM " ," SVM ")
9 > cor ( pdf )
10 CART GLM SVM
11 CART 1.0000000 0.516073 0.6101085
12 GLM 0.5160730 1.000000 0.8641040
13 SVM 0.6101085 0.864104 1.0000000

One can consider the average (probability) prediction

1
ybi = average ybitree , ybiglm , ybisvm = ybitree + ybiglm + ybisvm

3
1 > auc (( apply ( pdf ,1 , mean ) >.5) *1)

2 Accuracy
3 0.7417417
or
ybi = median ybitree , ybiglm , ybisvm

1 > auc (( apply ( pdf ,1 , median ) >.5) *1)

2 Accuracy
3 0.7477477

Ensemble techniques: Netflix price
2009: US$1,000,000 Netflix Prize, Winner BellKor’s Pragmatic

Chaos
ensemble of 107 models
“Our experience is that most efforts should be concentrated in
deriving substantially different approaches, rather than refining a
simple technique [...] We strongly believe that the success of an
ensemble approach depends on the ability of its various predictors
to expose different complementing aspects of the data. Experience
shows that this is very different than optimizing the accuracy of
each individual predictor”

(un)-stability of trees
1 > variable = rep ( NA ,10000)

2 > for ( i in 1:10000) {
3 + arbre = rpart ( PRONO ~. , data =
myocarde [ sample (1:71 , size =47)
,])
4 + variable [ i ] = as . character (
arbre$frame [1 ," var "])
5 + }
6 > table ( variable ) /100
7 INCAR INSYS REPUL
8 24.12 47.72 28.16
the first split in the tree in the myocarde

dataset is
I INSYS (47.7%)
I REPUL (28.2%)
I INCAR (24.1%)

Principe
I Les arbres sont peu robustes (sur la forme, pas forcément la

prévision)
I L’approche présentée avec les arbres de décision conduit à des
prédictions ayant une grande variance
I Le bagging, ou bootstrap aggregating est un ensemble de
(méta-)algorithmes conçus pour améliorer la stabilité et la
précision d’un modèle (ou d’un algorithme) utilisé. Il permet
également de réduire la variance et d’éviter le surajustement.

Algorithme Bagging
1. Générer B bases de données d’entrainement différentes.

2. À partir de chacune des bases de données d’entrainement,
construire un modèle fbb , b = 1, . . . , B.
3. Pour une nouvelle observation x, la prédiction agrégée
(bagging) sera
B
1 Xb
fbag (x) = fb (x).
B
b=1

Étape 1
(1) Générer les B bases de données.
I En pratique, on n’a pas accès à B bases de données
différentes.
I Si on divise la base de données initiale en B bases
indépendantes (comme pour de la validation croisée), on perd
de l’information et les bases ainsi créées sont trop petites (il
faut que B soit grand pour que le Bagging soit intéressant).
I On va plutôt utiliser du bootstrap pour générer ces B bases
de données.
I La base de données b, b = 1, . . . , B, de taille n est obtenue en
pigeant au hasard avec remise n observations de la base de
données initiale.
I Les B bases de données sont créées à partir de la même base
initiale → les prédictions obtenues fbb ne sont pas
indépendantes.

Étape 2
(2) Construction du modèle b, b = 1, . . . , B.

I En pratique, quand la technique du bagging est utilisée pour
des arbres de décision, ces derniers ne sont pas élagués.
I Chacun des arbres b, b = 1, . . . , B, a donc un faible biais mais
une grande variance.
I La réduction de la variance se fait à l’étape 3 en agrégeant les
différents arbres obtenus.

Étape 3
(3) Agrégation des modèles

B
!
1 Xb
Var fbag (x) = Var fb (x)
B
b=1
B
!
1 X
= 2 Var fbb (x)
B
b=1
B B
1 X X
= Cov(fbb1 (x), fbb2 (x))
B2
b1 =1 b2 =1
1
≤ 2 B 2 Var(fbb (x)) = Var(fbb (x))
B
puisque lorsque b1 6= b2 , on a Corr[fbb1 (x), fbb2 (x)] ≤ 1.

Étape 3
(3) Agrégation des modèles

En fait, si Var(fbb (x)) = σ 2 (x) et Corr[fbb1 (x), fbb2 (x)] = r (x),
1 − r (x) 2
Var fbag (x) = r (x)σ 2 (x) +

σ (x)
B
La variable sera d’autant plus faible que l’on aggrège des modèles
différents.
L’instabilité des arbres en fait de bons candidats pour de
l’aggrégation

Algorithme Bagging
I Empiriquement, on observe qu’agréger des centaines, voire

des milliers d’arbres (B = 100, 1000, 10000, . . .) augmente
beaucoup le pouvoir prédictif des arbres de décision.
I Par contre, les résultats sont plus difficiles à interpréter: pas
de graphique simple, etc.
I Les logiciels qui offrent ce type de méta-algorithme proposent
généralement des méthodes permettant de mesurer
l’importance des variables explicatives dans le modèle.

Random Forest
Algorithm 1: Random Forest

1 initialization : m ≤ p (number of features considered to split a
node), B (number of trees);
2 for b = 1, 2, ...B do
3 generate a bootstrap sample Dnb ;
4 fbb ← tree (CART), where each split is done by minimizing
some cost over a set of m features randomly chosen
B
1 Xb
5 frf (x) =
b fb (x)
B
b=1
The smaller m, the smaller the correlation between trees
√
Classically, m = p/3 for regression, m = p for classification

Random Forest
I erreur out-of-bag
/ Dnb ,
Soit Bi ⊂ {1, 2, · · · , B} l’ensemble des arbres tels que i ∈
1 Xb
ybi = fb (x i )
|Bi |
b∈Bi
que l’on utilise pour mesurer un risque empirique out-of-bag

n n
1X 2 1X
yi − ybi ou 1 yi 6= ybi
n n
i=1 i=1
I importance des variables (1)

cf discussion dans la partie 4 sur les arbres

Random Forest
I importance des variables (2)
Soit Ib l’échantillon associé à Db , et I b la partie out-of-bag. Sur
cet échantillon, on définie le risque out-of-bag
1 X b 2
Rbboob = fb (x i ) − yi
Ib
i∈I b
considérons une petite perturbation sur la variable j,

1 X b 2
Rbboob (j) = fb (xi1 , · · · , xij + ε, · · · , xik ) − yi
Ib
i∈I b
et on définie
B
1 X boob
Rb (j) − Rbboob

importancej =
B
b=1

Random Forest
1 > credit = read . csv ( url , header = TRUE , sep = " ,")
2 > F = c (1 ,2 ,4 ,5 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,15 ,16 ,17 ,18 ,19 ,20)
3 > for ( i in F ) credit [ , i ]= as . factor ( credit [ , i ])
4 > set . seed (123)
5 > i_test = sample (1: nrow ( credit ) , size =333)
6 > i_train = (1: nrow ( credit ) ) [ - i_test ]
7 > > set . seed (456)
8 > library ( randomForest )
9 > fit = randomForest ( Creditability ~. , data = credit )
10 > print ( fit )
11 Type of random forest : classification
12 Number of trees : 500
13 No . of variables tried at each split : 4
14
15 OOB estimate of error rate : 23.3%
16 Confusion matrix :
17 0 1 class . error
18 0 121 179 0.59666667
19 1 54 646 0.07714286

Random Forest
1 > varImpPlot ( fit )
2 > fit$importance
3 MeanDecreaseGini
4 Account . Balance 45.94
5 Duration . of . Cre 39.09
6 Payment . Status . 26.38
7 Purpose 36.61
8 Credit . Amount 50.31
9 Value . Savings . S 21.62
10 Length . of . curre 24.42
11 Instalment . per . 18.66
12 Sex ... Marital . S 14.87
13 Guarantors 7.62
14 Duration . in . Cur 18.48
15 Most . valuable . a 19.46
16 Age .. years . 38.45
17 Concurrent . Cred 9.51
18 Type . of . apartme 10.01
19 No . of . Credits . a 8.05
20 Occupation 13.18
21 No . of . dependent 4.86
Random Forest
to visualize the impact of xj ’s on y

1 > credit = read . csv ( url , header
= TRUE , sep = " ,")
2 > plot ( Creditability ~ Credit
. Amount , data = credit )
3 > plot ( Creditability ~
Account . Balance , data =
credit
4 > plot ( Creditability ~
Duration . of . Credit .. month
. , data = credit )
5 > plot ( Creditability ~ Age ..
years . , data = credit )

Random Forest
One can look at the out-of-bag error

1 > set . seed (456)
2 > fit = randomForest (
Creditability ~ . , data =
credit , ntree = 5000 , mtry
= 4)
3 > plot ( fit$err . rate [ , 1])
or with m = 10, or m = 2

Random Forest
to visualize partial dependence plots
1 > lv = levels (
cre d i t $ C r e d i t a bility )
2 > partialPlot ( fit , credit , "
Credit . Amount " , lv [1])
1 > set . seed (456)

2 > fit_train = randomForest (
Creditability ~ . , data =
credit [ i_train ,] , ntree =
500 , mtry = 4)
3 > print ( fit_train )
4
5 OOB estimate of
error rate : 23.54%
6 Confusion matrix :
7 0 1 class . error
8 0 86 120 0.5825243
9 1 37 424 0.0802603

Random Forest
let us look at prediction from our random

forest model
1 y = credit [ i_train ," Creditability
"]
2 y_test = predict ( fit_train , type ="
prob " , newdata = credit [ i_train
,]) [ ,2]
3 library ( pROC )
4 roc_train = roc (y , y_test , plot =
TRUE , col =" blue ")
5 y = credit [ i_test ," Creditability "]
6 y_test = predict ( fit_train , type ="
prob " , newdata = credit [ i_test ,])
[ ,2]
7 roc_test = roc (y , y_test , plot =
TRUE , col =" red ")

Random Forest
here, AUC is
1 > pROC :: auc ( roc_test )
2 Area under the curve : 0.7783
to be compared with a (basic) GLM

1 > glm_train <- glm ( Creditability ~
. , data = credit [ i_train ,] ,
family = binomial )
2 > y = credit [ i_test ," Creditability
"]
3 > y_test = predict ( glm_train , type
=" response " , newdata = credit [
i_test ,])
4 > roc_test <- roc (y , y_test , plot =
TRUE , col = colr [2] , lwd =3)
5 > pROC :: auc ( roc_test )
6 Area under the curve : 0.72

ACT6100 A2020 Sup 12

Uploaded by

Copyright:

Available Formats

ACT6100 A2020 Sup 12

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACT6100 A2020 Sup 12

Uploaded by

Copyright:

Available Formats

Data Science for Actuaries (ACT6100)

Aggregation # 5.1 (Ensemble & Parallel)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 37

@freakonometrics freakonometrics freakonometrics.hypotheses.org 2 / 37

Considérons un échantillon i.i.d. {(x i , yi ) ∈ Rp × Y},

et si fbn est consistante pour toute distribution, on parle de

@freakonometrics freakonometrics freakonometrics.hypotheses.org 3 / 37

f ? (x) = 1 P[Y = 1|X = x] ≥ P[Y = 0|X = x]

la éthode des k-plus proches voisins

où yj:x est le label de la j-ème plus proche observation de x parmi

sont des classifieurs universellement consistant, si k → ∞ mais

where t is the truth

I statistical interpretation Hopefully the ensemble generalises

@freakonometrics freakonometrics freakonometrics.hypotheses.org 6 / 37

e.g. n = 11 et p = 30%, and p = 40%

Probability that the majority is wrong, 7.8% and 24.6%.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 7 / 37

but this is valid only if votes are independent !

@freakonometrics freakonometrics freakonometrics.hypotheses.org 8 / 37

I split training sample into m groups of n/m observations

@freakonometrics freakonometrics freakonometrics.hypotheses.org 9 / 37

which is a weighted average of predictions (or posterior class

ωt = P[T = t|X = x] ∝ P[X = x|T = t]

are likelihoods of models.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 10 / 37

Estimating model likelihood can be complicated

1 url = " http :// freakonometrics . free . fr / german_credit .

create a training and a testing datasets

@freakonometrics freakonometrics freakonometrics.hypotheses.org 11 / 37

I model 2 : classification tree

I model 3 : SVM (with a Gaussian kernel)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 12 / 37

Consider a simple (majority based) voting procedure

Use AUC a classification metrics

@freakonometrics freakonometrics freakonometrics.hypotheses.org 14 / 37

1 > df$CREDIT = as . numeric ( credit [ i_test ," Creditability

1 > reglm = lm ( CREDIT ~ CART + GLM + SVM , data = df )

1 > reg = glm ( CREDIT ~ CART + GLM + SVM , data = df , family =

@freakonometrics freakonometrics freakonometrics.hypotheses.org 15 / 37

@freakonometrics freakonometrics freakonometrics.hypotheses.org 16 / 37

@freakonometrics freakonometrics freakonometrics.hypotheses.org 17 / 37

One can consider the average (probability) prediction

1 > auc (( apply ( pdf ,1 , mean ) >.5) *1)

1 > auc (( apply ( pdf ,1 , median ) >.5) *1)

@freakonometrics freakonometrics freakonometrics.hypotheses.org 18 / 37

2009: US$1,000,000 Netflix Prize, Winner BellKor’s Pragmatic

@freakonometrics freakonometrics freakonometrics.hypotheses.org 19 / 37

1 > variable = rep ( NA ,10000)

the first split in the tree in the myocarde

@freakonometrics freakonometrics freakonometrics.hypotheses.org 20 / 37

I Les arbres sont peu robustes (sur la forme, pas forcément la

@freakonometrics freakonometrics freakonometrics.hypotheses.org 21 / 37

1. Générer B bases de données d’entrainement différentes.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 22 / 37

@freakonometrics freakonometrics freakonometrics.hypotheses.org 23 / 37

(2) Construction du modèle b, b = 1, . . . , B.

@freakonometrics freakonometrics freakonometrics.hypotheses.org 24 / 37

(3) Agrégation des modèles

puisque lorsque b1 6= b2 , on a Corr[fbb1 (x), fbb2 (x)] ≤ 1.