-
Notifications
You must be signed in to change notification settings - Fork 23k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DDP] Allow tuning of first bucket #62748
Conversation
Previously after buckets were rebuilt the first bucket size was always defaulted to 1MB, this diff allows first bucket to be tuned like the rest of the bucket sizes can. Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as expected: I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size limits: 1, 1048, 1048 bytes. Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/) [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit d4e33a0 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_build (1/1)Step: "Build" (full log | diagnosis details | 🔁 rerun)
|
Previously after buckets were rebuilt the first bucket size was always defaulted to 1MB, this diff allows first bucket to be tuned like the rest of the bucket sizes can. Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as expected: I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size limits: 1, 1048, 1048 bytes. Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/) ghstack-source-id: 135063720 Pull Request resolved: #62748
Previously after buckets were rebuilt the first bucket size was always defaulted to 1MB, this diff allows first bucket to be tuned like the rest of the bucket sizes can. Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as expected: I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size limits: 1, 1048, 1048 bytes. Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/) [ghstack-poisoned]
@@ -630,6 +630,9 @@ def _ddp_init_helper( | |||
self.find_unused_parameters, | |||
self.gradient_as_bucket_view, | |||
param_to_name_mapping, | |||
# User can set dist._DEFAULT_FIRST_BUCKET_BYTES to tune DDP first | |||
# bucket. | |||
dist._DEFAULT_FIRST_BUCKET_BYTES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we are passing python config to c++, we may be able to kill this 'module.attr("_DEFAULT_FIRST_BUCKET_BYTES") = ::c10d::kDefaultFirstBucketBytes;' that is used for sharing config btw python and C++, and only define the constant in Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, filed #62790 to track it
CI failure is unrelated:
|
This pull request has been merged in 80091cb. |
Stack from ghstack:
Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.
Setting
dist._DEFAULT_FIRST_BUCKET_BYTES = 1
results in the following logs asexpected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.
Differential Revision: D30110041