An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests

502 Accesses
1 Altmetric
Explore all metrics

Abstract

Classification and regression trees (CART) prove to be a true alternative to full parametric models such as linear models (LM) and generalized linear models (GLM). Although CART suffer from a biased variable selection issue, they are commonly applied to various topics and used for tree ensembles and random forests because of their simplicity and computation speed. Conditional inference trees and model-based trees algorithms for which variable selection is tackled via fluctuation tests are known to give more accurate and interpretable results than CART, but yield longer computation times. Using a closed-form maximum likelihood estimator for GLM, this paper proposes a split point procedure based on the explicit likelihood in order to save time when searching for the best split for a given splitting variable. A simulation study for non-Gaussian response is performed to assess the computational gain when building GLM trees. We also propose a benchmark on simulated and empirical datasets of GLM trees against CART, conditional inference trees and LM trees in order to identify situations where GLM trees are efficient. This approach is extended to multiway split trees and log-transformed distributions. Making GLM trees possible through a new split point procedure allows us to investigate the use of GLM in ensemble methods. We propose a numerical comparison of GLM forests against other random forest-type approaches. Our simulation analyses show cases where GLM forests are good challengers to random forests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diversity Forests: Using Split Sampling to Enable Innovative Complex Split Procedures in Random Forests

Article Open access 21 October 2021

Conditional permutation importance revisited

Article Open access 14 July 2020

Construction of Artificial Most Representative Trees by Minimizing Tree-Based Distance Measures

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Algorithm 1 assesses parameter instability only if the splitting variable has two or more levels for a categorical variable, or two or more values for a numeric variable. Otherwise, there is nothing to split.
Our specification has no intercept since two dummy variables are introduced. This means that neither the left nor the right sub-sample is taken as a reference. Note that a model with one intercept and one dummy variable is equivalent and would simply lead to rewrite Eq. (6). However, the fitted log-likelihood is identical with or without intercept, see Corollary 3.1 and Examples 3.1, 3.2 of Brouste et al. (2020).
We do not consider an intercept as the fitted log-likelihood is identical with or without intercept, see Brouste et al. (2020, Corollary 3.1).
The p values can be computed by Monte-Carlo and then adjusted with Bonferroni. This fourth option has also be examined, but the results is not reported as the goodness-of-fits of results is comparable to the other ctree options, but the computation times is much longer.
Since the results of explicit GLM Tree and GLM Tree are equal, only those related to explicit GLM Tree are depicted.
We also model a post-pruned version of the CART model using the one standard error criterion, which produce more accurate results for CART model on the CategIG1 and the ContIG1 datasets, but explicit GLM Tree remains better in terms of RMSE.

References

Box, G.E.P., Cox, D.R.: An analysis of transformations revisited. J. Am. Stat. 77, 209–210 (1964)
Article Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees, New CRC, Boca Raton (1984)
MATH Google Scholar
Brouste, A., Dutang, C., Rohmer, T.: Closed form maximum likelihood estimator for generalized linear models in the case of categorical explanatory variables: application to insurance loss modelling. Comput. Stat. 35, 689–724 (2020)
Article Google Scholar
Chambers, J.M., Hastie, T.J.: Statistical Models in S. Chapman and Hall, London (1993)
MATH Google Scholar
Ciampi, A.: Generalized regression trees. Comput. Stat. Data Anal. 12(1), 57–78 (1991)
Article MathSciNet Google Scholar
Cortes, C., Vapnik, V.: Supportvector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Denuit, M., Hainaut, D., Trufin, J.: Effective Statistical Learning Methods for Actuaries I: GLMs and extensions. Springer Actuarial Lecture Notes. Springer, Berlin (2019)
Book Google Scholar
Fahrmeir, L., Kaufmann, H.: Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Stat. 13, 342–368 (1985)
Article MathSciNet Google Scholar
Farkas, S., Lopez, O., Thomas, M.: Cyber claim analysis using Generalized Pareto regression trees with applications to insurance. Insur. Math. Econ. 98, 92–105 (2021)
Article MathSciNet Google Scholar
Fokkema, M.: Fitting prediction rule ensembles with R Package pre. J. Stat. Softw. 92(1), 1–30 (2020)
Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. Nonlinear Methods Data Min. 38(4), 367–378 (2002)
Article MathSciNet Google Scholar
Gama, J.: Functional trees. Mach. Learn. 55(3), 219–250 (2004)
Article Google Scholar
Garge, N.R., Bobashev, G., Eggleston, B.: Random forest methodology for model-based recursive partitioning: the mobForest package for R. BMC Bioinform. 14, 125 (2013)
Article Google Scholar
Hothorn, T., Zeileis, A.: partykit: a modular toolkit for recursive partytioning in R. J. Mach. Learn. Res. 16, 3905–3909 (2015)
MathSciNet MATH Google Scholar
Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
Article MathSciNet Google Scholar
James, G., Witten, D., Hastie, T., Tibshirani, R.: ISLR: data for an introduction to statistical learning with applications in R (2017)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Ann. Appl. Stat. 29, 119–127 (1980)
Article Google Scholar
Kim, H., Loh, W.-Y.: Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 96(454), 589–604 (2001)
Article MathSciNet Google Scholar
Landwehr, N., Hall, M., Eibe, F.: Logistic model trees. Mach. Learn. 59, 161–205 (2005)
Article Google Scholar
Lawrence, J.: Introduction to Neural Networks: Design, Theory and Applications, 6th edn. California Scientific Software, Nevada City (1994)
Google Scholar
Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems (2021)
Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)
Liu, N.-T., Lin, F.-C., Shih, Y.-S.: Count regression trees. In: Advances in Data Analysis and Classification (2019)
Loh, W.-Y.: Regression trees with unbiased variable selection and interaction detection. Stat. Sin. 12(2), 361–386 (2002)
MathSciNet MATH Google Scholar
Loh, W.-Y.: Fifty years of classification and regression trees. Int. Stat. Rev. 82(3), 329–348 (2014)
Article MathSciNet Google Scholar
Loh, W.-Y., Shih, Y.-S.: Split selection methods for classification trees. Stat. Sin. 7(4), 815–840 (1997)
MathSciNet MATH Google Scholar
Loh, W.-Y., Vanichsetakul, N.: Tree-structured classification via generalized discriminant analysis. J. Am. Stat. Assoc. 83(403), 715–725 (1988)
Article MathSciNet Google Scholar
McCullagh, P., Nelder, J.A.: Generalized Linear Models. Statistics and Applied Probability, vol. 37, 2nd edn. CRC, Boca Raton (1989)
Book Google Scholar
Philipp, M., Rusch, T., Hornik, K., Strobl, C.: Measuring the stability of results from supervised statistical learning. J. Comput. Graph. Stat. 27(4), 685–700 (2018)
Article MathSciNet Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2021)
Rusch, T., Zeileis, A.: Gaining insight with recursive partitioning of generalized linear models. J. Stat. Comput. Simul. 83(7), 1301–1315 (2013)
Article MathSciNet Google Scholar
Seber, G.A.F., Lee, A.J.: Linear Regression Analysis. Wiley, Hoboken (2003)
Book Google Scholar
Seibold, H., Hothorn, T., Zeileis, A.: Generalised linear model trees with global additive effects. In: Advances in Data Analysis and Classification (2018)
Su, X., Wang, M., Fan, J.: Maximum likelihood regression trees. J. Comput. Graph. Stat. 13(3), 586–598 (2004)
Article MathSciNet Google Scholar
Szöcs, E., Schäfer, R.B.: Ecotoxicology is not normal: a comparison of statistical approaches for analysis of count and proportion data in ecotoxicology. Environ. Sci. Pollut. Res. 22(18), 13990–13999 (2015)
Article Google Scholar
Therneau, T., Atkinson, B.: rpart: recursive partitioning and regression trees (2019)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S. Springer, Berlin (2002)
Book Google Scholar
Weisberg, S.: Applied Linear Regression. Wiley, Hoboken (2005)
Book Google Scholar
Wilson, K., Grenfell, B.T.: Generalized linear modelling for parasitologists. Parasitol. Today 13(1), 33–38 (1997)
Article Google Scholar
Wood, S.N.: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(1), 3–36 (2011)
Article MathSciNet Google Scholar
Zeileis, A., Hornik, K.: Generalized M- fluctuation tests for parameter instability. Stat. Neerl. 61(4), 488–508 (2007)
Article MathSciNet Google Scholar
Zeileis, A., Hothorn, T., Hornik, K.: Model-based recursive partitioning. J. Comput. Graph. Stat. 17(2), 492–514 (2008)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are also very grateful for the useful suggestions of the two anonymous referees, which led to significant improvements of this article. The remaining errors, of course, should be attributed to the authors alone. This paper also benefits from fruitful discussions with members of the French chair DIALog – Digital Insurance And Long-term risks – under the aegis of the Fondation du Risque, a joint initiative by UCBL and CNP; and with members of the French laboratory SAF (UCBL and ISFA).

Author information

Authors and Affiliations

CEREMADE, CNRS, Univ. Paris-Dauphine, Univ. PSL, Pl. du Ml de Lattre de Tassigny, 75016, Paris, France
Christophe Dutang & Quentin Guibert
Prim’Act, 42 av. de la Grande Armée, 75017, Paris, France
Quentin Guibert

Authors

Christophe Dutang
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Guibert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Dutang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Computation details on likelihood and deviance

1.1 A.1 Proof of the fitted log-likelihood

Using Corollary 3.1 of Brouste et al. (2020) for the j-th partitioning variable and using notations from Table 2, we have

$$\begin{aligned}&\log L(\widehat{\theta }_L{(s)}, \widehat{\theta }_R{(s)}, {\varvec{y}}, s) \nonumber \\= & {} \frac{1}{a(\phi )} \sum _{i \in L(j,s)}{\left( y_i {\tilde{b}}\left( {\overline{y}}_j^{L}(s)\right) - b\left( {\tilde{b}}\left( {\overline{y}}_j^{L}(s)\right) \right) \right) } \nonumber \\&+\frac{1}{a(\phi )} \sum _{i \in R(j,s)}{\left( y_i {\tilde{b}}\left( {\overline{y}}_j^{R}(s)\right) - b\left( {\tilde{b}}\left( {\overline{y}}_j^{R}(s)\right) \right) \right) } \nonumber \\&\quad + \sum _{i=1}^nc(y_i,\phi ) \nonumber \\= & {} \frac{1}{a(\phi )}\sum _{l \in \left\{ L,R\right\} }\left( {\tilde{b}}\left( {\overline{y}}_j^{l}(s)\right) m_j^l(s){\overline{y}}_j^l(s)\right. \nonumber \\&\left. - b\left( {\tilde{b}}\left( {\overline{y}}_j^{l}(s)\right) \right) m_j^l(s)\right) \nonumber \\&+ \sum _{i=1}^nc(y_i,\phi ), \end{aligned}$$

(15)

where ${\tilde{b}}=(b^\prime )^{-1}$ is the inverse of $b^\prime $.

1.2 A.2 Proof of the fitted deviance

Let us note the fitted mean ${\hat{\mu _i}}=g^{-1}(<z_i,\widehat{{\varvec{\theta }}{(s)}}>)={\overline{y}}_n^{(j)}$. The deviance is given by

$$\begin{aligned} D(\widehat{{\varvec{\mu }}},{\varvec{y}})= & {} -2\log L(\widehat{{\varvec{\mu }}},\phi ,{\varvec{y}}) +2\log L({\varvec{y}},\phi ,{\varvec{y}}) \\= & {} \frac{-2}{a(\phi )}\sum _{j=1}^2 {\tilde{b}}\left( {\overline{y}}_n^{(j)}\right) {\overline{y}}_n^{(j)} m_j \\&\quad - \frac{-2}{a(\phi )}\sum _{j=1}^2 b\left( {\tilde{b}}\left( {\overline{y}}_n^{(j)}\right) \right) m_j \\&+\frac{2}{a(\phi )}\sum _{i=1}^n {\tilde{b}}\left( y_i\right) y_i - \frac{2}{a(\phi )}\sum _{i=1}^n b\left( {\tilde{b}}\left( y_i\right) \right) . \end{aligned}$$

1.3 A.3 Proof of the fitted log-likelihood

We adapt the proof of Brouste et al. (2020) in order to take into account weights when maximizing the log-likelihood. For a known weight $w_i$, their Equations (4) and (5) become

$$\begin{aligned}&\log L({\varvec{\theta }}\,|\,{\varvec{y}}) = \sum _{i=1}^n w_i \frac{y_i\ell (\eta _i) - b\left( \ell (\eta _i)\right) }{a(\phi )} + \sum _{i=1}^nw_ic(y_i,\phi ), \\&S_j({\varvec{\theta }}) = \frac{1}{a(\phi )}\sum _{i=1}^n w_i x_i^{(j)}\ell '(\eta _i)\left( y_i - b'\left( \ell (\eta _i)\right) \right) . \end{aligned}$$

We adapt their proof of Theorem 3.1. The system $S({\varvec{\theta }})=0$ is

$$\begin{aligned} \left\{ \begin{array}{ll}\displaystyle \sum _{i=1}^n\ell '(\eta _i)w_i\left( y_i - b'\circ \ell (\eta _i)\right) = 0\\ \displaystyle \sum _{i=1}^nx_i^{(2),j}\ell '(\eta _i)w_i\left( y_i - b'\circ \ell (\eta _i)\right) = 0,\quad \forall j\in J. \end{array}\right. \end{aligned}$$

The first equation being redundant, the second equation simplifies to $\forall j\in J$

$$\begin{aligned}&\ell '(\theta _{(1)}+\theta _{(2),j})\sum \limits _{i=1}^nx_i^{(2),j}w_iy_i \\&- \ell '(\theta _{(1)}+\theta _{(2),j})b'\circ \ell (\theta _{(1)}+\theta _{(2),j})\sum \limits _{i=1}^nx_i^{(2),j}w_i = 0. \end{aligned}$$

Using weighted frequencies $ m_w^{(j)} = \sum _{i=1}^nx_i^{(2),j}w_i$, and averages ${\overline{y}}_w^{(j)}= \frac{1}{m_w^{(j)}}\sum _{i=1}^nx_i^{(2),j}w_iy_i. $ Hence if $y_i$ takes values in $\mathbb {Y}\subset b'(\Lambda )$, and $\ell $ injective, we have for all $j\in J$

$$\begin{aligned} {\overline{y}}_w^{(j)}=b'\circ \ell (\theta _{(1)}+\theta _{(2),j}) \Leftrightarrow \theta _{(1)}+\theta _{(j)} = g(\overline{y}_w^{(j)}). \end{aligned}$$

The rest of the proof is identical except to replace $\overline{y}_n^{(j)}$ by $\overline{y}_w^{(j)}$. Hence, a slight modification of their Theorem 3.1 is for $j\in J$

$$\begin{aligned} \widehat{{\varvec{\theta }}}_{n,(1)}= & {} \frac{\displaystyle \sum \nolimits _{k=1}^d r_{k}g(\overline{y}_w^{(k)})}{\displaystyle \sum \nolimits _{k=1}^d r_{k} - r_{0}}, \widehat{{\varvec{\theta }}}_{n,(2),j}\\= & {} g(\overline{y}_w^{(j)}) - \frac{\displaystyle \sum \nolimits _{k=1}^d r_{k}g(\overline{y}_w^{(k)}) }{\displaystyle \sum \nolimits _{k=1}^d r_{k} - r_{0}}. \end{aligned}$$

For the two-variable case, we can also adapt their Theorem 3.2 by changing absolute frequencies and averages to weighted frequencies and averages as below.

Frequency	Mean	Index
$m_k^{(2)} = \sum \limits _{i=1}^n x_i^{(2),k}w_i $	${\overline{y}}_w^{(2),k} =\frac{1}{m_k^{(2)}} \sum \limits _{i=1}^n y_i x_i^{(2),k}w_i$	$k\in K$
$m_l^{(3)} = \sum \limits _{i=1}^n x_i^{(3),l}w_i$	${\overline{y}}_w^{(3),l} =\frac{1}{m_l^{(3)}} \sum \limits _{i=1}^n y_i x_i^{(3),l}w_i$	$l\in L$
$m_{k,l} = \sum \limits _{i=1}^n x_i^{(k,l)}w_i$	${\overline{y}}_w^{(k,l)}=\frac{1}{m_{k,l}}\sum \limits _{i=1}^{n} y_i x_i^{(k,l)}w_i$	$(k,l)\in K\times L$

1.4 A.4 Proof of the log-likelihood of transformed variable

For $t(x) = \log (d_1 x+d_2)$, $t'(x) = \frac{d_1}{d_1x+d_2}$, $t^{-1}(x)=(e^x-d_2)/d_1$ the log-likelihood of y such that t(y) follows (1) is given by

$$\begin{aligned} \log f_Y(y) = \frac{\lambda t(y) - b(\lambda )}{a(\phi )} + c(y,\phi )+ \log \left( \frac{d_1}{d_1y+d_2}\right) . \end{aligned}$$

Therefore, the fitted log-likelihood based on (15) is

$$\begin{aligned} \begin{aligned}&\log L_j({\hat{\theta _L}} {(s)},{\hat{\theta _R}} {(s)},{\varvec{y}}, s)\\&\quad = \frac{1}{a(\phi )} \sum _{l\in \{L, R\}} {\tilde{b}}\left( {\overline{t}}_{j,w}^{l}(s)\right) m_{j,w}^l(s) {\overline{t}}_{j,w}^{l}(s) \\&\quad - \frac{1}{a(\phi )} \sum _{l\in \{L, R\}} b\left( {\tilde{b}}\left( {\overline{t}}_{j,w}^{l}(s)\right) \right) m_{j,w}^l(s) + \sum _{i=1}^n{\tilde{c}}(y_i,\phi ), \end{aligned} \end{aligned}$$

(16)

where $t_{j,w}^{l}(s)$ are the corresponding observation of random variable $T_{j,w}^{l}(s)$ defined in Table 4.

B Additional materials for numerical illustrations

1.1 B.1 Simon Wood’s test functions

We define smooth functions to Simon Wood’s test datasets (Wood 2011) as

$$\begin{aligned} \begin{aligned}&f_0=5\sin (2\pi x),~ f_1=\exp (3x)-7, \\&f_2=0.5\times x^{11}(10(1 - x))^6 - 10 (10x)^3(1 - x)^{10}, \\&f_3=15 \exp (-5 |x-1/2|)-6, \\&f_4=2-1_{(x<= 1/3)}(6x)^3 - 1_{(x>= 2/3)} (6-6x)^3 \\&\quad \quad - 1_{(2/3> x> 1/3)}(8+2\sin (9(x-1/3)\pi )), \\&f_5(x) =\lfloor 20x\rfloor -10,~ f_6(x) =10-\lceil 20x\rceil ,~ \\&f_7(x) =\sin (50 x)+10 x-10,\\&f_8(x) =8+2\cos (50 x)-50 x (1-x),~ \\&f_9(x) =\lceil 50 x (1-x)\rceil -5, f_{10}(x) =5\log (x+10^{-6})+5, \\&f_{11}(x) =-10-5\log (x+10^{-6})+\sin (50x), \\&f_{12}(x) =2\log (x+10^{-6})-2\log (1-x+10^{-6}),\\&f_{13}(x) =10|\sin (20 x)|, \\&f_{14}(x) =1_{(x <= 1/2)}\times 5\sin (20 x) \\&\quad \quad + 1_{(x > 1/2)}\times (5\sin (10) + (\exp (5 (x-0.5))-1)). \end{aligned} \end{aligned}$$

1.2 B.2 Runtime comparison between GLM tree and explicit GLM tree

See Fig. 6.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dutang, C., Guibert, Q. An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests. Stat Comput 32, 6 (2022). https://doi.org/10.1007/s11222-021-10059-x

Download citation

Received: 01 June 2021
Accepted: 19 October 2021
Published: 11 November 2021
DOI: https://doi.org/10.1007/s11222-021-10059-x

Frequency	Mean	Index
\(m_k^{(2)} = \sum \limits _{i=1}^n x_i^{(2),k}w_i \)	\({\overline{y}}_w^{(2),k} =\frac{1}{m_k^{(2)}} \sum \limits _{i=1}^n y_i x_i^{(2),k}w_i\)	\(k\in K\)
\(m_l^{(3)} = \sum \limits _{i=1}^n x_i^{(3),l}w_i\)	\({\overline{y}}_w^{(3),l} =\frac{1}{m_l^{(3)}} \sum \limits _{i=1}^n y_i x_i^{(3),l}w_i\)	\(l\in L\)
\(m_{k,l} = \sum \limits _{i=1}^n x_i^{(k,l)}w_i\)	\({\overline{y}}_w^{(k,l)}=\frac{1}{m_{k,l}}\sum \limits _{i=1}^{n} y_i x_i^{(k,l)}w_i\)	\((k,l)\in K\times L\)