-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Allow saving Dask RandomForest models immediately after training (fixes #3331) #3388
Conversation
@jameslamb I'm sorry to hear that you had trouble building cuML from the source. Here is a page for a Dockerized setup. Also, feel free to ping me if you'd like more help with building cuML (I do it at least several times a week). |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #3388 +/- ##
===============================================
+ Coverage 71.48% 71.60% +0.12%
===============================================
Files 207 210 +3
Lines 16748 16932 +184
===============================================
+ Hits 11973 12125 +152
- Misses 4775 4807 +32
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and thank you for the contribution! I have two small requests.
Also, one test is failing but that is for an unrelated issue currently being fixed in #3391
yet been trained. | ||
""" | ||
|
||
# set internal model if it hasn't been accessed before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move these to dask.ensemble.base instead of the two versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure, I can try that. I didn't think that would work because cuml.dask.ensemble.base.BaseRandomForestModel
inherits from object
and doesn't have self._get_internal_model()
defined.
But I see now that it references that method (
cuml/python/cuml/dask/ensemble/base.py
Line 172 in 816bb65
if self._get_internal_model() is None: |
cuml.dask.common.base.BaseEstimator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved in 83f10b7
) | ||
X = X.astype(np.float32) | ||
if estimator_type == 'classification': | ||
cu_rf_mg = cuRFC_mg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the params here are either defaults or close to defaults. You could omit them to shrink the test and make it easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, no problem. I copied this exactly from the existing dask random forest tests:
cuml/python/cuml/test/dask/test_random_forest.py
Lines 366 to 378 in 816bb65
cu_rf_mg = cuRFC_mg(max_features=1.0, max_samples=1.0, | |
n_bins=16, split_algo=0, split_criterion=0, | |
min_samples_leaf=2, seed=23707, n_streams=1, | |
n_estimators=n_estimators, max_leaves=-1, | |
max_depth=max_depth) | |
y = y.astype(np.int32) | |
elif estimator_type == 'regression': | |
cu_rf_mg = cuRFR_mg(max_features=1.0, max_samples=1.0, | |
n_bins=16, split_algo=0, | |
min_samples_leaf=2, seed=23707, n_streams=1, | |
n_estimators=n_estimators, max_leaves=-1, | |
max_depth=max_depth) | |
y = y.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 83f10b7. I tried to keep a few that would have the most impact on the runtime of training, to keep the tests quick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a fix in 6a1f2d2, sorry. I got confused by the different levels of inheritance
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, @jameslamb ! Sorry about the delay, I wanted to also try this out locally. Pickles smoothly now.
@gpucibot merge |
No problem! Thanks for the reviews, and to @hcho3 for the pointer to a dockerized setup for building cuml from source. I'll definitely make use of that in the future. |
This attempts to fix #3331. See that issue for a lot more details.
Today,
.get_combined_model()
for the Dask RandomForest model objects returnsNone
if it's called immediately after training. That pattern is recommended in "Distributed Model Pickling". Without this support, there is not a way to save a Dask RandomForest model using only public methods / attributes on those classes.Per #3331 (comment), this PR proposes populating the internal model object whenever
get_combined_model()
is called.Notes for Reviewers
cuml
from source following https://github.com/rapidsai/cuml/blob/main/BUILD.md, and was not successful. If there is a containerized setup for developingcuml
, I'd greatly appreciate it and would be happy to try it out. I've added a unit test for this change, so I hope that will be enough to confirm that this works and that CI will catch any mistakes I've made.Thanks for your time and consideration.