Optimizations of distributed 'hist' mode for CPUs #4824

SmirnovEgorRu · 2019-09-01T16:18:33Z

I observed that distributed XGBoost on 1 node is slower than non-distributed (Batch) XGBoost. Difference is 1.5-3x times depending on data set and parameters. The reasons of this:

We took always left tree node for complex histogram computation and right node - computed by subtraction trick. But I added choosing of the smallest tree node across of computation nodes.
Poor threading for histogram reduction in distributed case - also fixed.

Performance measurements:

Data set	higgs1m
1 node (before), s	5.29
1 node (this PR), s	3.66
Batch mode, s	3.53

So, we see 1.5x speedup against prev version and similar time as in Batch mode.

CodingCat · 2019-09-04T23:33:08Z

except optimizing, could you check #4716 which involves a correctness issue in the current syncing strategy

hcho3 · 2019-09-05T00:10:42Z

@CodingCat Good point, we cannot let the sync issue unaddressed before 1.0.0. I'll see if there's anyone in my org to help, since it affects our work e.g. here (Also will try to get some time myself too.)

Right now, #4716 solved the correctness issue. However, I suspect that the current XGBoost code still performs sync per-node instead of per-depth, slowing down distributed training.

cc @ericangelokim @iyerr3

SmirnovEgorRu · 2019-09-10T23:37:23Z

@CodingCat @hcho3 I have found an issue that distributed workers can build different trees and I have fixed this.

I also checked log-loss for higgs (4 local workers)- it is exactly the same for 2 runs:
1 run: 0.48785404282370270490076791247702203691005706787109375
2 run: 0.48785404282370270490076791247702203691005706787109375

Is it enough for this PR?

codecov-io · 2019-09-10T23:40:16Z

Codecov Report

Merging #4824 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4824   +/-   ##
=======================================
  Coverage   77.51%   77.51%           
=======================================
  Files          11       11           
  Lines        2055     2055           
=======================================
  Hits         1593     1593           
  Misses        462      462

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2aed0ae...e27fdeb. Read the comment docs.

SmirnovEgorRu · 2019-09-11T13:58:37Z

@thvasilo it was done to reuse this->histred_ rabit::Reducer object and don't create new once special for size_t. So, we just compute sum of all samples in each tree node using floats instead of size_t.

thvasilo · 2019-09-11T14:15:42Z

@thvasilo it was done to reuse this->histred_ rabit::Reducer object and don't create new once special for size_t. So, we just compute sum of all samples in each tree node using floats instead of size_t.

I'm afraid this would confuse potential future maintainers, so a comment should be added at least.

Is it possible to use one of the existing aggregation ops (Sum) here, like in the rabit usage example? We shouldn't need a Reducer object to sum/max some integers I think.

SmirnovEgorRu · 2019-09-12T09:41:42Z

@thvasilo I added your variant of Reduce

thvasilo · 2019-09-12T11:01:30Z

Thanks @SmirnovEgorRu, I think this better gets across the purpose of the variables now.

SmirnovEgorRu · 2019-09-12T11:07:20Z

CI contains following error, looks like it is not my error:
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)
../../../miniconda/envs/python3/lib/python3.7/ssl.py:1139: SSLCertVerificationError

trivialfis · 2019-09-12T12:16:59Z

I can see. It might be Travis has outdated ssl certificate, not sure yet.

trivialfis

Sorry for the long wait. In general, I prefer clean and well structured code. For example, the Partition*Kernel, is it possible to just use std::partition? If not then I would add some comments for it. If it's just for performance consideration can we make a partition function that can closely resemble the interface of std::partition? Another example, since we are creating task in quantile hist, so are we using task graph and async? If so is it possible to define an explicit reusable structure for it, even the implementation is not as good as other parallel library?

To me the amount of work for cleaning up quantile hist is just huge ...

trivialfis · 2019-09-24T08:44:11Z

src/tree/updater_quantile_hist.cc

-    // TODO(egorsmir): add parallel for
-    for (auto elem : nodes) {
-      if (elem.sibling_nid > -1) {
-        SubtractionTrick(hist_[elem.sibling_nid], hist_[elem.nid],


So SubtractionTrick can be deleted and the following becomes an implicit "subtraction trick"? It would be surprising that the subtraction trick is taking significant time. Maybe removing the pruner can help performance better, see #4874

trivialfis · 2019-09-24T08:46:59Z

src/tree/updater_quantile_hist.cc

      common::GradStatHist::GradType* hist_data =
          reinterpret_cast<common::GradStatHist::GradType*>(hist_[nid].data());

-      ReduceHistograms(hist_data, nullptr, nullptr, 0,  hist_builder_.GetNumBins() * 2, i,
+      ReduceHistograms(hist_data, nullptr, nullptr, cut_ptr[fid] * 2,  cut_ptr[fid + 1] * 2, node,


Pass a Span instead of flying pointers.

trivialfis · 2019-09-24T08:49:29Z

CONTRIBUTORS.md

@@ -102,3 +102,5 @@ List of Contributors
 * [Haoda Fu](https://github.com/fuhaoda)
 * [Evan Kepner](https://github.com/EvanKepner)
  - Evan Kepner added support for os.PathLike file paths in Python
+* [Egor Smirnov](https://github.com/SmirnovEgorRu)


You can open a separated PR for noting your contribution from previous PRs. I think I don't prefer merging this one. But @hcho3 will make the decision.

trivialfis · 2019-09-24T08:52:02Z

src/tree/updater_quantile_hist.cc

@@ -589,7 +596,7 @@ void QuantileHistMaker::Builder::BuildHistsBatch(const std::vector<ExpandEntry>&

  // 3. Merge grad stats for each node
  //    Sync histograms in case of distributed computation
-  SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats);
+  SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats, gmat);

  perf_monitor.UpdatePerfTimer(TreeGrowingPerfMonitor::timer_name::BUILD_HIST);


I replaced this perf_monitor with common::Monitor before and somehow it's still here ...

SmirnovEgorRu · 2020-02-25T22:04:37Z

Closed since it's not relevant code. Anyway, optimizations idea can be used in farther commits.

CC @ShvetsKS

SmirnovEgorRu added 2 commits September 1, 2019 19:01

optimized distr hist mode

b892d4f

changes in CONTRIBUTORS.md

966c0bb

trivialfis requested review from hcho3 and trivialfis September 1, 2019 18:39

fix in distributed mode

e27fdeb

SmirnovEgorRu mentioned this pull request Sep 12, 2019

[blocking] fix parallel eval_split of hist updater #4851

Merged

code cleaning

755be86

code cleaning

8c990dd

trivialfis requested changes Sep 24, 2019

View reviewed changes

SmirnovEgorRu closed this Feb 25, 2020

SmirnovEgorRu mentioned this pull request Apr 21, 2020

Distributed optimizations #5557

Merged

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations of distributed 'hist' mode for CPUs #4824

Optimizations of distributed 'hist' mode for CPUs #4824

Optimizations of distributed 'hist' mode for CPUs #4824

Optimizations of distributed 'hist' mode for CPUs #4824

Conversation

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment