[c10d] Use pg wrapper in detailed debug mode #58281

rohan-varma · 2021-05-14T00:13:29Z

Stack from ghstack:

[c10d] Use pg wrapper in detailed debug mode #58281 [c10d] Use pg wrapper in detailed debug mode

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by new_group and init_process_group that are nccl or gloo to be wrapped in ProcessGroupWrapper.

As a result, the user will get back a ProcessGroupWrapper that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.

Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.

Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.

Differential Revision: D28402301

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

facebook-github-bot · 2021-05-14T00:13:34Z

💊 CI failures summary and remediations

As of commit c8d218c (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

May 25 10:19:13 The PR is introducing backward ...m to confirm whether this change is wanted or not.

May 25 10:19:13 processing existing schema:  alltoall(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, Tensor[] _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
May 25 10:19:13 processing existing schema:  send(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
May 25 10:19:13 processing existing schema:  recv(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
May 25 10:19:13 processing existing schema:  recv_anysource(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
May 25 10:19:13 processing existing schema:  barrier(__torch__.torch.classes.dist_c10d.ProcessGroup _0) -> (__torch__.torch.classes.dist_c10d.Work _0)
May 25 10:19:13 processing existing schema:  __init__(__torch__.torch.classes.dist_c10d.frontend _0) -> (NoneType _0)
May 25 10:19:13 processing existing schema:  new_process_group_helper(__torch__.torch.classes.dist_c10d.frontend _0, int _1, int _2, int[] _3, str _4, __torch__.torch.classes.dist_c10d.Store _5, str? _6, int _7) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
May 25 10:19:13 processing existing schema:  get_process_group_by_name(__torch__.torch.classes.dist_c10d.frontend _0, str _1) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
May 25 10:19:13 processing existing schema:  get_name_of_process_group(__torch__.torch.classes.dist_c10d.frontend _0, __torch__.torch.classes.dist_c10d.ProcessGroup _1) -> (str _0)
May 25 10:19:13 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (NoneType _0)
May 25 10:19:13 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
May 25 10:19:13 
May 25 10:19:13 Broken ops: [
May 25 10:19:13 	aten::repeat_interleave.Tensor(Tensor repeats, int? output_size=None) -> (Tensor)
May 25 10:19:13 	aten::repeat_interleave.self_Tensor(Tensor self, Tensor repeats, int? dim=None, int? output_size=None) -> (Tensor)
May 25 10:19:13 	aten::repeat_interleave.self_int(Tensor self, int repeats, int? dim=None, int? output_size=None) -> (Tensor)
May 25 10:19:13 ]
May 25 10:19:13 =================== sccache compilation log ===================
May 25 10:19:13 =========== If your build fails, please take a look at the log above for possible reasons ===========
May 25 10:19:13 Compile requests                      0
May 25 10:19:13 Compile requests executed             0

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 128958127 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 128973598 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129006190 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129096382 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129135017 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

zhaojuanmao · 2021-05-17T19:25:01Z

torch/testing/_internal/distributed/distributed_test.py

+            # thus accesses std::map, which fills in a default value for the
+            # type if it didn't exist.
+            self.assertTrue(
+                ddp_logging_data.get("comm_hook") is None or


nit: self.assertTrue(ddp_logging_data.get("comm_hook", ""), "")

zhaojuanmao · 2021-05-17T19:25:27Z

torch/testing/_internal/distributed/distributed_test.py

+            # Note: DETAIL debug mode logs DDP logging data to stdout and
+            # thus accesses std::map, which fills in a default value for the
+            # type if it didn't exist.
+            self.assertTrue(


same as above

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129538124 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129559841 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129650541 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129730696 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/) [ghstack-poisoned]

Pull Request resolved: #58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129817857 Differential Revision: [D28402301](https://our.internmc.facebook.com/intern/diff/D28402301/)

facebook-github-bot · 2021-05-25T16:57:13Z

This pull request has been merged in 19bcbfc.

Summary: Pull Request resolved: pytorch#58281 When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`. As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs. Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled. Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff. ghstack-source-id: 129817857 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D28402301 fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede

rohan-varma requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, wayi1 and zhaojuanmao as code owners May 14, 2021 00:13

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels May 14, 2021

rohan-varma mentioned this pull request May 14, 2021

[c10d] Introduce ProcessGroupWrapper #58224

Closed

zhaojuanmao approved these changes May 17, 2021

View reviewed changes

rohan-varma mentioned this pull request May 20, 2021

[c10d] Fix monitored_barrier with wait_all_ranks #58702

Closed

facebook-github-bot closed this in 19bcbfc May 25, 2021

facebook-github-bot added the Merged label May 25, 2021

facebook-github-bot deleted the gh/rohan-varma/315/head branch May 29, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c10d] Use pg wrapper in detailed debug mode #58281

[c10d] Use pg wrapper in detailed debug mode #58281

[c10d] Use pg wrapper in detailed debug mode #58281

[c10d] Use pg wrapper in detailed debug mode #58281

Conversation

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_backward_compatibility_check_test (1/1)

Choose a reason for hiding this comment

Choose a reason for hiding this comment