TF Boosted Trees: A Scalable TensorFlow Based Framework for Gradient Boosting

Natalia Ponomareva²²,
Soroush Radpour²³,
Gilbert Hendry²⁴,
Salem Haykal²³,
Thomas Colthurst²⁴,
Petr Mitrichev²⁵ &
…
Alexander Grushetsky²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10536))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3596 Accesses
12 Citations

Abstract

TF Boosted Trees (TFBT) is a new open-sourced framework for the distributed training of gradient boosted trees. It is based on TensorFlow, and its distinguishing features include a novel architecture, automatic loss differentiation, layer-by-layer boosting that results in smaller ensembles and faster prediction, principled multi-class handling, and a number of regularization techniques to prevent overfitting.

You have full access to this open access chapter, Download conference paper PDF

Transferring Tree Ensembles to Neural Networks

Boosted Residual Networks

Tree-Based Models for Federated Learning Systems

Keywords

1 Introduction

Gradient boosted trees are popular machine learning models. Since their introduction in [3] they have gone on to dominate many competitions on real-world data, including Kaggle and KDDCup [2]. In addition to their excellent accuracy, they are also easy to use, as they deal well with unnormalized, collinear, missing, or outlier-infected data. They can support custom loss functions and are often easier to interpret than neural nets or large linear models. Because of their popularity, there are now many gradient boosted tree implementations, including scikit-learn [7], R gbm [8], Spark MLLib [5], LightGBM [6], XGBoost [2].

In this paper, we introduce another optimized and scalable gradient boosted tree library, TF Boosted Trees (TFBT), which is built on top of the TensorFlow framework [1]. TFBT incorporates a number of novel algorithmic improvements to the gradient boosting algorithm, including new per-layer boosting procedure which offers improved performance on some problems. TFBT is open source, and available in the main TensorFlow distribution under contrib/boosted_trees.

2 TFBT Features

In Table 1 we provide a brief comparison between TFBT and some existing libraries. Additionally, TFBT provides the following.

Layer-by-Layer Boosting. TFBT supports two modes of tree building: standard (building sequence of boosted trees in a stochastic gradient fashion) and novel Layer-by-Layer boosting, which allows for stronger trees (leading to faster convergence) and deeper models. One weakness of tree-based methods is the fact that only the examples falling under a given partition are used to produce the estimator associated with that leaf, so deeper nodes use statistics calculated from fewer examples. We overcome that limitation by recalculating the gradients and Hessians whenever a new layer is built resulting in stronger trees that better approximate the functional space gradient. This enables deeper nodes to use higher level splits as priors meaning each new layer will have more information and will be able to better adjust for errors from the previous layers. Empirically we found that layer-by-layer boosting generally leads to faster convergence and, with proper regularization, to less overfitting for deeper trees.

Multiclass Support. TFBT supports one-vs-rest, as well as 2 variations that reduce the number of required trees by storing per-class scores at each leaf. All other implementations use one-vs-rest (MLLib has no multiclass support).

Table 1. Comparison of gradient boosted libraries.

Full size table

Since TFBT is implemented in TensorFlow, TensorFlow specific features are also available

Ease of writing custom loss functions, as TensorFlow provides automatic differentiation [1] (other packages like XGBoost require the user to provide the first and second order derivatives).
Ability to easily switch and compare TFBT with other TensorFlow models.
Ease of debugging with TensorBoard.
Models can be run on multiple CPUs/GPUs and on multiple platforms, including mobile, and can be easily deployed via TF serving.
Checkpointing for fault tolerance, incremental training & warm restart.

3 TFBT System Design

Finding Splits. One of the most computationally intensive parts in boosting is finding the best splits. Both R and scikit-learn work with an exact greedy algorithm for enumerating all possible splits for all features, which does not scale. Other implementations, like XGBoost, work with approximate algorithms to build quantiles of feature values and aggregating gradients and Hessians for each bucket of quantiles. For aggregation, two approaches can be used [4]: either each of the workers works on all the features, and then the statistics are aggregated in Map-Reduce (MLLib) or All-Reduce (XGBoost) fashion, or a parameter server (PS) approach (TencentBoost [4], PSMART [9]) is applied (each worker and PS aggregates statistics only for a subset of features). The All-Reduce versions do not scale to a high-dimensional data and Map-Reduce versions are slow to scale.

TFBT Architecture. Our computation model is based on the following needs:

1.
Ability to train on datasets that don’t fit in workers’ memory.
2.
Ability to train deeper trees with a larger number of features.
3.
Support for different modes of building the trees: standard one-tree-per-batch mode, as well as boosting the tree layer-by-layer.
4.
Minimizing parallelization costs. Low cost restarts on stateless workers would allow us to use much cheaper preemptible VMs.

Our design is similar to XGBoost [2], TencentBoost [4] in that we build distributed quantile sketches of feature values and use them to build histograms, to be used later to find the best split. In TencentBoost [4] and PSMART [9] full training data is partitioned and loaded in workers’ memory, which can be a problem for larger datasets. To address this we instead work on mini-batches, updating quantiles in an online fashion without loading all the data into the memory. As far as we know, this approach is not implemented anywhere else.

Each worker loads a mini-batch of data, builds a local quantile sketch, pushes it to PS and fetches the bucket boundaries that were built at the previous iteration (Fig. 1). Workers then compute per bucket gradients and Hessians and push them back to the PS. One of the workers, designated as Chief, checks during each iteration if the PS have accumulated enough statistics for the current layer and if so, starts building the new layer by finding best splits for each of the nodes in the layer. Code that finds the best splits for each feature is executed on the PS that have accumulated the gradient statistics for the feature. The Chief receives the best split for every leaf from the PS and grows a new layer on the tree.

Once the Chief adds a new layer, both gradients and quantiles become stale. To avoid stale updates, we introduce an abstraction called StampedResource - a TensorFlow resource with an int64 stamp. Tree ensemble, as well as gradients and quantile accumulators are all stamped resources with such token. When the worker fetches the model, it gets the stamp token which is then used for all the reads and writes to stamped resources until the end of the iteration. This guarantees that all the updates are consistent and ensures that Chief doesn’t need to wait for Workers for synchronization, which is important when using preemptible VMs. Chief checkpoints resources to disk and workers don’t hold any state, so if they are restarted, they can load a new mini-batch and continue.

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. In: OSDI (2016)
Google Scholar
Chen, T., et al.: XGBoost: a scalable tree boosting system. CoRR (2016)
Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)
Article MathSciNet MATH Google Scholar
Jiang, J., Jiang, J., Cui, B., Zhang, C.: TencentBoost: a gradient boosting tree system with parameter server. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 281–284 (2017). https://doi.org/10.1109/ICDE.2017.87
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. CoRR (2015). http://arxiv.org/abs/1505.06807
Microsoft: Microsoft/dmtk (2013). https://github.com/microsoft/dmtk
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ridgeway, G.: Generalized boosted models: a guide to the GBM package (2005)
Google Scholar
Zhou, J., et al.: PSMART: parameter server based multiple additive regression trees system. In: WWW 2017 Companion (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Google, Inc., New York, USA
Natalia Ponomareva
Google, Inc., Mountain View, USA
Soroush Radpour, Salem Haykal & Alexander Grushetsky
Google, Inc., Cambridge, USA
Gilbert Hendry & Thomas Colthurst
Google, Inc., Zurich, Switzerland
Petr Mitrichev

Authors

Natalia Ponomareva
View author publications
You can also search for this author in PubMed Google Scholar
Soroush Radpour
View author publications
You can also search for this author in PubMed Google Scholar
Gilbert Hendry
View author publications
You can also search for this author in PubMed Google Scholar
Salem Haykal
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Colthurst
View author publications
You can also search for this author in PubMed Google Scholar
Petr Mitrichev
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Grushetsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalia Ponomareva .

Editor information

Editors and Affiliations

Google Research, Google Inc., Zurich, Switzerland
Yasemin Altun
NASA Ames Research Center, Mountain View, USA
Kamalika Das
Oath, Sunnyvale, USA
Taneli Mielikäinen
Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
Donato Malerba
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Jerzy Stefanowski
Laboratoire d’ Informatique (LIX), École Polytechnique, Palaiseau, France
Jesse Read
Department of Computer Science, Stanford University, Stanford, USA
Marinka Žitnik
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ponomareva, N. et al. (2017). TF Boosted Trees: A Scalable TensorFlow Based Framework for Gradient Boosting. In: Altun, Y., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10536. Springer, Cham. https://doi.org/10.1007/978-3-319-71273-4_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-71273-4_44
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71272-7
Online ISBN: 978-3-319-71273-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TF Boosted Trees: A Scalable TensorFlow Based Framework for Gradient Boosting

Abstract

Similar content being viewed by others