-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations of distributed 'hist' mode for CPUs #4824
Conversation
except optimizing, could you check #4716 which involves a correctness issue in the current syncing strategy |
@CodingCat Good point, we cannot let the sync issue unaddressed before 1.0.0. I'll see if there's anyone in my org to help, since it affects our work e.g. here (Also will try to get some time myself too.) Right now, #4716 solved the correctness issue. However, I suspect that the current XGBoost code still performs sync per-node instead of per-depth, slowing down distributed training. |
@CodingCat @hcho3 I have found an issue that distributed workers can build different trees and I have fixed this. I also checked log-loss for higgs (4 local workers)- it is exactly the same for 2 runs: Is it enough for this PR? |
Codecov Report
@@ Coverage Diff @@
## master #4824 +/- ##
=======================================
Coverage 77.51% 77.51%
=======================================
Files 11 11
Lines 2055 2055
=======================================
Hits 1593 1593
Misses 462 462 Continue to review full report at Codecov.
|
@thvasilo it was done to reuse this->histred_ rabit::Reducer object and don't create new once special for size_t. So, we just compute sum of all samples in each tree node using floats instead of size_t. |
I'm afraid this would confuse potential future maintainers, so a comment should be added at least. Is it possible to use one of the existing aggregation ops (Sum) here, like in the rabit usage example? We shouldn't need a Reducer object to sum/max some integers I think. |
@thvasilo I added your variant of Reduce |
Thanks @SmirnovEgorRu, I think this better gets across the purpose of the variables now. |
CI contains following error, looks like it is not my error: |
I can see. It might be Travis has outdated ssl certificate, not sure yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long wait. In general, I prefer clean and well structured code. For example, the Partition*Kernel
, is it possible to just use std::partition
? If not then I would add some comments for it. If it's just for performance consideration can we make a partition function that can closely resemble the interface of std::partition
? Another example, since we are creating task in quantile hist, so are we using task graph and async? If so is it possible to define an explicit reusable structure for it, even the implementation is not as good as other parallel library?
To me the amount of work for cleaning up quantile hist is just huge ...
// TODO(egorsmir): add parallel for | ||
for (auto elem : nodes) { | ||
if (elem.sibling_nid > -1) { | ||
SubtractionTrick(hist_[elem.sibling_nid], hist_[elem.nid], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So SubtractionTrick
can be deleted and the following becomes an implicit "subtraction trick"? It would be surprising that the subtraction trick is taking significant time. Maybe removing the pruner can help performance better, see #4874
common::GradStatHist::GradType* hist_data = | ||
reinterpret_cast<common::GradStatHist::GradType*>(hist_[nid].data()); | ||
|
||
ReduceHistograms(hist_data, nullptr, nullptr, 0, hist_builder_.GetNumBins() * 2, i, | ||
ReduceHistograms(hist_data, nullptr, nullptr, cut_ptr[fid] * 2, cut_ptr[fid + 1] * 2, node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass a Span
instead of flying pointers.
@@ -102,3 +102,5 @@ List of Contributors | |||
* [Haoda Fu](https://github.com/fuhaoda) | |||
* [Evan Kepner](https://github.com/EvanKepner) | |||
- Evan Kepner added support for os.PathLike file paths in Python | |||
* [Egor Smirnov](https://github.com/SmirnovEgorRu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can open a separated PR for noting your contribution from previous PRs. I think I don't prefer merging this one. But @hcho3 will make the decision.
@@ -589,7 +596,7 @@ void QuantileHistMaker::Builder::BuildHistsBatch(const std::vector<ExpandEntry>& | |||
|
|||
// 3. Merge grad stats for each node | |||
// Sync histograms in case of distributed computation | |||
SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats); | |||
SyncHistograms(p_tree, nodes, hist_buffers, hist_is_init, grad_stats, gmat); | |||
|
|||
perf_monitor.UpdatePerfTimer(TreeGrowingPerfMonitor::timer_name::BUILD_HIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced this perf_monitor
with common::Monitor
before and somehow it's still here ...
Closed since it's not relevant code. Anyway, optimizations idea can be used in farther commits. CC @ShvetsKS |
I observed that distributed XGBoost on 1 node is slower than non-distributed (Batch) XGBoost. Difference is 1.5-3x times depending on data set and parameters. The reasons of this:
Performance measurements:
So, we see 1.5x speedup against prev version and similar time as in Batch mode.