Keywords

1 Introduction

Gradient boosted trees are popular machine learning models. Since their introduction in [3] they have gone on to dominate many competitions on real-world data, including Kaggle and KDDCup [2]. In addition to their excellent accuracy, they are also easy to use, as they deal well with unnormalized, collinear, missing, or outlier-infected data. They can support custom loss functions and are often easier to interpret than neural nets or large linear models. Because of their popularity, there are now many gradient boosted tree implementations, including scikit-learn [7], R gbm [8], Spark MLLib [5], LightGBM [6], XGBoost [2].

In this paper, we introduce another optimized and scalable gradient boosted tree library, TF Boosted Trees (TFBT), which is built on top of the TensorFlow framework [1]. TFBT incorporates a number of novel algorithmic improvements to the gradient boosting algorithm, including new per-layer boosting procedure which offers improved performance on some problems. TFBT is open source, and available in the main TensorFlow distribution under contrib/boosted_trees.

2 TFBT Features

In Table 1 we provide a brief comparison between TFBT and some existing libraries. Additionally, TFBT provides the following.

Layer-by-Layer Boosting. TFBT supports two modes of tree building: standard (building sequence of boosted trees in a stochastic gradient fashion) and novel Layer-by-Layer boosting, which allows for stronger trees (leading to faster convergence) and deeper models. One weakness of tree-based methods is the fact that only the examples falling under a given partition are used to produce the estimator associated with that leaf, so deeper nodes use statistics calculated from fewer examples. We overcome that limitation by recalculating the gradients and Hessians whenever a new layer is built resulting in stronger trees that better approximate the functional space gradient. This enables deeper nodes to use higher level splits as priors meaning each new layer will have more information and will be able to better adjust for errors from the previous layers. Empirically we found that layer-by-layer boosting generally leads to faster convergence and, with proper regularization, to less overfitting for deeper trees.

Multiclass Support. TFBT supports one-vs-rest, as well as 2 variations that reduce the number of required trees by storing per-class scores at each leaf. All other implementations use one-vs-rest (MLLib has no multiclass support).

Table 1. Comparison of gradient boosted libraries.

Since TFBT is implemented in TensorFlow, TensorFlow specific features are also available

  • Ease of writing custom loss functions, as TensorFlow provides automatic differentiation [1] (other packages like XGBoost require the user to provide the first and second order derivatives).

  • Ability to easily switch and compare TFBT with other TensorFlow models.

  • Ease of debugging with TensorBoard.

  • Models can be run on multiple CPUs/GPUs and on multiple platforms, including mobile, and can be easily deployed via TF serving.

  • Checkpointing for fault tolerance, incremental training & warm restart.

3 TFBT System Design

Finding Splits. One of the most computationally intensive parts in boosting is finding the best splits. Both R and scikit-learn work with an exact greedy algorithm for enumerating all possible splits for all features, which does not scale. Other implementations, like XGBoost, work with approximate algorithms to build quantiles of feature values and aggregating gradients and Hessians for each bucket of quantiles. For aggregation, two approaches can be used [4]: either each of the workers works on all the features, and then the statistics are aggregated in Map-Reduce (MLLib) or All-Reduce (XGBoost) fashion, or a parameter server (PS) approach (TencentBoost [4], PSMART [9]) is applied (each worker and PS aggregates statistics only for a subset of features). The All-Reduce versions do not scale to a high-dimensional data and Map-Reduce versions are slow to scale.

TFBT Architecture. Our computation model is based on the following needs:

  1. 1.

    Ability to train on datasets that don’t fit in workers’ memory.

  2. 2.

    Ability to train deeper trees with a larger number of features.

  3. 3.

    Support for different modes of building the trees: standard one-tree-per-batch mode, as well as boosting the tree layer-by-layer.

  4. 4.

    Minimizing parallelization costs. Low cost restarts on stateless workers would allow us to use much cheaper preemptible VMs.

Fig. 1.
figure 1

TFBT architecture.

figure a

Our design is similar to XGBoost [2], TencentBoost [4] in that we build distributed quantile sketches of feature values and use them to build histograms, to be used later to find the best split. In TencentBoost [4] and PSMART [9] full training data is partitioned and loaded in workers’ memory, which can be a problem for larger datasets. To address this we instead work on mini-batches, updating quantiles in an online fashion without loading all the data into the memory. As far as we know, this approach is not implemented anywhere else.

Each worker loads a mini-batch of data, builds a local quantile sketch, pushes it to PS and fetches the bucket boundaries that were built at the previous iteration (Fig. 1). Workers then compute per bucket gradients and Hessians and push them back to the PS. One of the workers, designated as Chief, checks during each iteration if the PS have accumulated enough statistics for the current layer and if so, starts building the new layer by finding best splits for each of the nodes in the layer. Code that finds the best splits for each feature is executed on the PS that have accumulated the gradient statistics for the feature. The Chief receives the best split for every leaf from the PS and grows a new layer on the tree.

Once the Chief adds a new layer, both gradients and quantiles become stale. To avoid stale updates, we introduce an abstraction called StampedResource - a TensorFlow resource with an int64 stamp. Tree ensemble, as well as gradients and quantile accumulators are all stamped resources with such token. When the worker fetches the model, it gets the stamp token which is then used for all the reads and writes to stamped resources until the end of the iteration. This guarantees that all the updates are consistent and ensures that Chief doesn’t need to wait for Workers for synchronization, which is important when using preemptible VMs. Chief checkpoints resources to disk and workers don’t hold any state, so if they are restarted, they can load a new mini-batch and continue.