Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDP] Allow tuning of first bucket #62748

Closed
wants to merge 2 commits into from

Conversation

rohan-varma
Copy link
Member
@rohan-varma rohan-varma commented Aug 4, 2021

Stack from ghstack:

Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting dist._DEFAULT_FIRST_BUCKET_BYTES = 1 results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.

Differential Revision: D30110041

Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.

Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Aug 4, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit d4e33a0 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Aug 04 21:23:29 collect2: error: ld returned 1 exit status
Aug 04 21:23:26 [ 94%] Building CXX object caffe2/CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/deep_wide_pt.cc.o
Aug 04 21:23:27 [ 94%] Built target int8_roi_align_op_test
Aug 04 21:23:27 [ 94%] Building CXX object caffe2/CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_utils.cc.o
Aug 04 21:23:27 [ 94%] Building CXX object caffe2/CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_static_runtime.cc.o
Aug 04 21:23:27 Scanning dependencies of target init_test
Aug 04 21:23:28 [ 94%] Building CXX object caffe2/CMakeFiles/init_test.dir/core/init_test.cc.o
Aug 04 21:23:28 [ 94%] Linking CXX executable ../bin/static_runtime_test
Aug 04 21:23:28 [ 94%] Linking CXX executable ../bin/init_test
Aug 04 21:23:29 CMakeFiles/static_runtime_test.dir/__/benchmarks/static_runtime/test_utils.cc.o: In function `torch::jit::test::testStaticRuntime(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, std::vector<c10::IValue, std::allocator<c10::IValue> > const&, bool, bool, bool)':
Aug 04 21:23:29 test_utils.cc:(.text+0x2f2f): undefined reference to `torch::jit::disableUnsafeMathOp(char const*)'
Aug 04 21:23:29 collect2: error: ld returned 1 exit status
Aug 04 21:23:29 make[2]: *** [bin/static_runtime_test] Error 1
Aug 04 21:23:29 caffe2/CMakeFiles/static_runtime_test.dir/build.make:157: recipe for target 'bin/static_runtime_test' failed
Aug 04 21:23:29 CMakeFiles/Makefile2:10153: recipe for target 'caffe2/CMakeFiles/static_runtime_test.dir/all' failed
Aug 04 21:23:29 make[1]: *** [caffe2/CMakeFiles/static_runtime_test.dir/all] Error 2
Aug 04 21:23:29 make[1]: *** Waiting for unfinished jobs....
Aug 04 21:23:29 [ 94%] Built target init_test
Aug 04 21:23:31 [ 94%] Built target caffe2_pybind11_state
Aug 04 21:23:31 make: *** [all] Error 2
Aug 04 21:23:31 Makefile:138: recipe for target 'all' failed
Aug 04 21:23:31 -- Building version 1.10.0a0+git7e1db18

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

rohan-varma added a commit that referenced this pull request Aug 4, 2021
Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.

Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/)

ghstack-source-id: 135063720
Pull Request resolved: #62748
Previously after buckets were rebuilt the first bucket size was always
defaulted to 1MB, this diff allows first bucket to be tuned like the rest of
the bucket sizes can.

Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as
expected:
I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size
limits: 1, 1048, 1048 bytes.

Differential Revision: [D30110041](https://our.internmc.facebook.com/intern/diff/D30110041/)

[ghstack-poisoned]
@@ -630,6 +630,9 @@ def _ddp_init_helper(
self.find_unused_parameters,
self.gradient_as_bucket_view,
param_to_name_mapping,
# User can set dist._DEFAULT_FIRST_BUCKET_BYTES to tune DDP first
# bucket.
dist._DEFAULT_FIRST_BUCKET_BYTES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we are passing python config to c++, we may be able to kill this 'module.attr("_DEFAULT_FIRST_BUCKET_BYTES") = ::c10d::kDefaultFirstBucketBytes;' that is used for sharing config btw python and C++, and only define the constant in Python

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, filed #62790 to track it

@rohan-varma
Copy link
Member Author

CI failure is unrelated:

Aug 04 21:23:29 test_utils.cc:(.text+0x2f2f): undefined reference to `torch::jit::disableUnsafeMathOp(char const*)'

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 80091cb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants