Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDP] Log when errors happen #59281

Closed
wants to merge 4 commits into from

Conversation

rohan-varma
Copy link
Member
@rohan-varma rohan-varma commented Jun 2, 2021

Stack from ghstack:

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: D28652717

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: [D28652717](https://our.internmc.facebook.com/intern/diff/D28652717/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Jun 2, 2021

💊 CI failures summary and remediations

As of commit 172aa56 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 02 22:44:36 [ FAILED ] AtenXlaTensorTest.TestBitwiseAndPromotion
Jun 02 22:44:36 [----------] 1 test from XlaUtilCacheTest (0 ms total)
Jun 02 22:44:36 
Jun 02 22:44:36 [----------] Global test environment tear-down
Jun 02 22:44:36 [==========] 592 tests from 8 test suites ran. (504694 ms total)
Jun 02 22:44:36 [  PASSED  ] 588 tests.
Jun 02 22:44:36 [  SKIPPED ] 1 test, listed below:
Jun 02 22:44:36 [  SKIPPED ] AtenXlaTensorTest.TestGroupNormBackward
Jun 02 22:44:36 [  FAILED  ] 3 tests, listed below:
Jun 02 22:44:36 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAnd
Jun 02 22:44:36 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAndScalar
Jun 02 22:44:36 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAndPromotion
Jun 02 22:44:36 
Jun 02 22:44:36  3 FAILED TESTS
Jun 02 22:44:36 + cleanup
Jun 02 22:44:36 + retcode=1
Jun 02 22:44:36 + set +x
Jun 02 22:44:36 =================== sccache compilation log ===================
Jun 02 22:44:36 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 02 22:44:36 Compile requests                      0
Jun 02 22:44:36 Compile requests executed             0
Jun 02 22:44:36 Cache hits                            0

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jun 2, 2021
rohan-varma added a commit that referenced this pull request Jun 2, 2021
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: [D28652717](https://our.internmc.facebook.com/intern/diff/D28652717/)

ghstack-source-id: 130330544
Pull Request resolved: #59281
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: [D28652717](https://our.internmc.facebook.com/intern/diff/D28652717/)

[ghstack-poisoned]
@rohan-varma
Copy link
Member Author

Looking into the test failures.

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: [D28652717](https://our.internmc.facebook.com/intern/diff/D28652717/)

[ghstack-poisoned]
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.

Differential Revision: [D28652717](https://our.internmc.facebook.com/intern/diff/D28652717/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 79aeca0.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/320/head branch June 6, 2021 14:16
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Pull Request resolved: pytorch#59281

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28652717

fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants