-
Notifications
You must be signed in to change notification settings - Fork 23k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be fix shard inbalance #60206
Be fix shard inbalance #60206
Conversation
💊 CI failures summary and remediationsAs of commit ddb4511 (more details on the Dr. CI page and at hud.pytorch.org/pr/60206):
🕵️ 6 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_bionic_py3_6_clang9_noarch_test (1/6)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
self.name == 'cpp': # The caffe2 cpp tests spawn duplicate test cases as well. | ||
time_difference = self.test_suites[suite_name].replace(test_case) | ||
self.total_time += time_difference | ||
if is_multi_test: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we already knew the test are ran exactly twice. so I think we are fine when the test file is in the list above
7c4b1a6
to
8cc1922
Compare
@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
self.total_time = self.total_time + test_case.time - old_time | ||
self.test_cases[name] = test_case | ||
return test_case.time - old_time | ||
self.test_cases[name].time += test_case.time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the design decision to take the sum of the times instead of the max? What about tests that were run in parallel? (This was the preliminary reason why we did max instead of sum.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an example of test that ran in parallel?
from what i can see all those distributed tests are ran sequentially, see:
Lines 716 to 721 in 59b1003
return_code = run_test(test_module, test_directory, options, | |
launcher_cmd=mpiexec) | |
else: | |
return_code = run_test(test_module, test_directory, options) | |
if return_code != 0: | |
return return_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I previously misunderstood how the tests were spawned. Since test_distributed_spawn
seems to be just spawn sequentially in run_test.py
multiple times, summing is probably better. The downside is that there are other configurations where we might be better off taking the max (like test_cpp_extensions), but those seem like smaller tests in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@walterddr merged this pull request in c0f8cad. |
First step to address #60136