Be fix shard inbalance #60206

walterddr · 2021-06-17T20:27:06Z

First step to address #60136

facebook-github-bot · 2021-06-17T20:27:13Z

💊 CI failures summary and remediations

As of commit ddb4511 (more details on the Dr. CI page and at hud.pytorch.org/pr/60206):

6/7 failures introduced in this PR
1/7 broken upstream at merge base 0cbb5e1 on Jun 17 from 1:00pm to 4:55pm

🕵️ 6 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_bionic_py3_6_clang9_noarch_test (1/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 17 22:57:43 RuntimeError: test_linalg failed! Received signal: SIGIOT

Jun 17 22:57:43   test_einsum_corner_cases_cpu (__main__.TestLinalgCPU) ... ok (0.008s)
Jun 17 22:57:43   test_einsum_cpu_complex128 (__main__.TestLinalgCPU) ... ok (0.016s)
Jun 17 22:57:43   test_einsum_cpu_float64 (__main__.TestLinalgCPU) ... ok (0.012s)
Jun 17 22:57:43   test_einsum_error_cases_cpu (__main__.TestLinalgCPU) ... ok (0.048s)
Jun 17 22:57:43   test_einsum_random_cpu_complex128 (__main__.TestLinalgCPU) ... free(): invalid pointer
Jun 17 22:57:43 Traceback (most recent call last):
Jun 17 22:57:43   File "test/run_test.py", line 1313, in <module>
Jun 17 22:57:43     main()
Jun 17 22:57:43   File "test/run_test.py", line 1292, in main
Jun 17 22:57:43     raise RuntimeError(err_message)
Jun 17 22:57:43 RuntimeError: test_linalg failed! Received signal: SIGIOT
Jun 17 22:57:44 
Jun 17 22:57:44 real	27m35.073s
Jun 17 22:57:44 user	36m54.353s
Jun 17 22:57:44 sys	6m43.485s
Jun 17 22:57:44 + cleanup
Jun 17 22:57:44 + retcode=1
Jun 17 22:57:44 + set +x
Jun 17 22:57:44 =================== sccache compilation log ===================
Jun 17 22:57:44 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 17 22:57:44 Compile requests                      87

pytorch_macos_10_13_py3_test (2/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jun 17 22:53:17 RuntimeError: test_linalg failed!

Jun 17 22:53:16 
Jun 17 22:53:16 FAILED (errors=2, skipped=51)
Jun 17 22:53:16 
Jun 17 22:53:16 Generating XML reports...
Jun 17 22:53:16 Generated XML report: test-reports/dist-gloo/test_linalg/TEST-TestLinalgCPU-20210617224850.xml
Jun 17 22:53:17 Traceback (most recent call last):
Jun 17 22:53:17   File "test/run_test.py", line 1313, in <module>
Jun 17 22:53:17     main()
Jun 17 22:53:17   File "test/run_test.py", line 1292, in main
Jun 17 22:53:17     raise RuntimeError(err_message)
Jun 17 22:53:17 RuntimeError: test_linalg failed!
Jun 17 22:53:17 + cleanup
Jun 17 22:53:17 + retcode=1
Jun 17 22:53:17 + set +x


Exited with code exit status 1

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (3/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 17 23:46:22 RuntimeError: test_linalg failed! Received signal: SIGIOT

Jun 17 23:46:21 7f5b9305a000-7f5b9305b000 rw-p 00000000 00:00 0 
Jun 17 23:46:21 7ffe7b3c1000-7ffe7b3e7000 rw-p 00000000 00:00 0                          [stack]
Jun 17 23:46:21 7ffe7b3f5000-7ffe7b3f8000 r--p 00000000 00:00 0                          [vvar]
Jun 17 23:46:21 7ffe7b3f8000-7ffe7b3f9000 r-xp 00000000 00:00 0                          [vdso]
Jun 17 23:46:21 ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
Jun 17 23:46:22 Traceback (most recent call last):
Jun 17 23:46:22   File "test/run_test.py", line 1313, in <module>
Jun 17 23:46:22     main()
Jun 17 23:46:22   File "test/run_test.py", line 1292, in main
Jun 17 23:46:22     raise RuntimeError(err_message)
Jun 17 23:46:22 RuntimeError: test_linalg failed! Received signal: SIGIOT
Jun 17 23:46:23 + cleanup
Jun 17 23:46:23 + retcode=1
Jun 17 23:46:23 + set +x
Jun 17 23:46:23 =================== sccache compilation log ===================
Jun 17 23:46:23 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 17 23:46:23 Compile requests                      0
Jun 17 23:46:23 Compile requests executed             0
Jun 17 23:46:23 Cache hits                            0
Jun 17 23:46:23 Cache misses                          0
Jun 17 23:46:23 Cache timeouts                        0

pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 (4/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 17 23:20:51 RuntimeError: test_linalg failed! Received signal: SIGIOT

Jun 17 23:20:50   test_einsum_corner_cases_cuda (__main__.TestLinalgCUDA) ... ok (0.013s)
Jun 17 23:20:50   test_einsum_cuda_complex128 (__main__.TestLinalgCUDA) ... ok (0.020s)
Jun 17 23:20:50   test_einsum_cuda_float64 (__main__.TestLinalgCUDA) ... ok (0.015s)
Jun 17 23:20:50   test_einsum_error_cases_cuda (__main__.TestLinalgCUDA) ... ok (0.047s)
Jun 17 23:20:50   test_einsum_random_cuda_complex128 (__main__.TestLinalgCUDA) ... free(): invalid pointer
Jun 17 23:20:51 Traceback (most recent call last):
Jun 17 23:20:51   File "/var/lib/jenkins/workspace/test/run_test.py", line 1313, in <module>
Jun 17 23:20:51     main()
Jun 17 23:20:51   File "/var/lib/jenkins/workspace/test/run_test.py", line 1292, in main
Jun 17 23:20:51     raise RuntimeError(err_message)
Jun 17 23:20:51 RuntimeError: test_linalg failed! Received signal: SIGIOT
Jun 17 23:20:51 
Jun 17 23:20:51 real	39m59.396s
Jun 17 23:20:51 user	58m58.634s
Jun 17 23:20:51 sys	39m58.629s
Jun 17 23:20:51 + cleanup
Jun 17 23:20:51 + retcode=1
Jun 17 23:20:51 + set +x
Jun 17 23:20:51 =================== sccache compilation log ===================
Jun 17 23:20:52 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 17 23:20:52 Compile requests                      0

pytorch_linux_xenial_py3_clang5_asan_test1 (5/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 17 23:49:41 RuntimeError: test_linalg failed!

Jun 17 23:49:41     #115 0x55c64891ab0d in main /tmp/build/80754af9/python_1614113050744/work/Programs/python.c:69
Jun 17 23:49:41     #116 0x7fdd75f3e83f in __libc_start_main /build/glibc-S7Ft5T/glibc-2.23/csu/../csu/libc-start.c:291
Jun 17 23:49:41     #117 0x55c6489f9d6f in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jun 17 23:49:41 
Jun 17 23:49:41 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:780:41 in 
Jun 17 23:49:41 Traceback (most recent call last):
Jun 17 23:49:41   File "test/run_test.py", line 1313, in <module>
Jun 17 23:49:41     main()
Jun 17 23:49:41   File "test/run_test.py", line 1292, in main
Jun 17 23:49:41     raise RuntimeError(err_message)
Jun 17 23:49:41 RuntimeError: test_linalg failed!
Jun 17 23:49:41 + cleanup
Jun 17 23:49:41 + retcode=1
Jun 17 23:49:41 + set +x
Jun 17 23:49:41 =================== sccache compilation log ===================
Jun 17 23:49:41 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 17 23:49:42 Compile requests                      0
Jun 17 23:49:42 Compile requests executed             0
Jun 17 23:49:42 Cache hits                            0
Jun 17 23:49:42 Cache misses                          0
Jun 17 23:49:42 Cache timeouts                        0

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (6/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 17 23:39:52 RuntimeError: test_sort_and_select failed! Received signal: SIGIOT

Jun 17 23:39:52 7f88a55fb000-7f88a55fc000 rw-p 00000000 00:00 0 
Jun 17 23:39:52 7fff787b1000-7fff787d2000 rw-p 00000000 00:00 0                          [stack]
Jun 17 23:39:52 7fff787e3000-7fff787e6000 r--p 00000000 00:00 0                          [vvar]
Jun 17 23:39:52 7fff787e6000-7fff787e7000 r-xp 00000000 00:00 0                          [vdso]
Jun 17 23:39:52 ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
Jun 17 23:39:52 Traceback (most recent call last):
Jun 17 23:39:52   File "test/run_test.py", line 1313, in <module>
Jun 17 23:39:52     main()
Jun 17 23:39:52   File "test/run_test.py", line 1292, in main
Jun 17 23:39:52     raise RuntimeError(err_message)
Jun 17 23:39:52 RuntimeError: test_sort_and_select failed! Received signal: SIGIOT
Jun 17 23:39:53 + cleanup
Jun 17 23:39:53 + retcode=1
Jun 17 23:39:53 + set +x
Jun 17 23:39:53 =================== sccache compilation log ===================
Jun 17 23:39:53 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 17 23:39:53 Compile requests                      6
Jun 17 23:39:53 Compile requests executed             1
Jun 17 23:39:53 Cache hits                            1
Jun 17 23:39:53 Cache hits (C/C++)                    1
Jun 17 23:39:53 Cache misses                          0

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

pytorch_linux_bionic_py3_8_gcc9_coverage_test2 on Jun 17 from 1:00pm to 4:55pm (462448f - e2129d1)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

walterddr · 2021-06-17T20:27:55Z

tools/print_test_stats.py

-               self.name == 'cpp':  # The caffe2 cpp tests spawn duplicate test cases as well.
-                time_difference = self.test_suites[suite_name].replace(test_case)
-                self.total_time += time_difference
+            if is_multi_test:


here we already knew the test are ran exactly twice. so I think we are fine when the test file is in the list above

tools/print_test_stats.py

facebook-github-bot · 2021-06-17T23:06:01Z

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

janeyx99 · 2021-06-17T23:15:09Z

tools/print_test_stats.py

-        self.total_time = self.total_time + test_case.time - old_time
-        self.test_cases[name] = test_case
-        return test_case.time - old_time
+        self.test_cases[name].time += test_case.time


So is the design decision to take the sum of the times instead of the max? What about tests that were run in parallel? (This was the preliminary reason why we did max instead of sum.)

do you have an example of test that ran in parallel?
from what i can see all those distributed tests are ran sequentially, see:

pytorch/test/run_test.py

Lines 716 to 721 in 59b1003

return_code = run_test(test_module, test_directory, options,

launcher_cmd=mpiexec)

else:

return_code = run_test(test_module, test_directory, options)

if return_code != 0:

return return_code

Ah, I think I previously misunderstood how the tests were spawned. Since test_distributed_spawn seems to be just spawn sequentially in run_test.py multiple times, summing is probably better. The downside is that there are other configurations where we might be better off taking the max (like test_cpp_extensions), but those seem like smaller tests in general.

janeyx99

nice

facebook-github-bot · 2021-06-18T19:51:25Z

@walterddr merged this pull request in c0f8cad.

add test type into the suite name

a816015

walterddr requested review from samestep and janeyx99 June 17, 2021 20:27

facebook-github-bot added the cla signed label Jun 17, 2021

walterddr commented Jun 17, 2021

View reviewed changes

samestep reviewed Jun 17, 2021

View reviewed changes

tools/print_test_stats.py Outdated Show resolved Hide resolved

tools/print_test_stats.py Outdated Show resolved Hide resolved

tools/print_test_stats.py Show resolved Hide resolved

alter logic to support multiple test runs on distributed/*

8cc1922

walterddr force-pushed the be_fix_shard_inbalance branch from 7c4b1a6 to 8cc1922 Compare June 17, 2021 21:16

fix mypy

ddb4511

janeyx99 reviewed Jun 17, 2021

View reviewed changes

janeyx99 approved these changes Jun 18, 2021

View reviewed changes

facebook-github-bot closed this in c0f8cad Jun 18, 2021

facebook-github-bot added the Merged label Jun 18, 2021

walterddr mentioned this pull request Jun 21, 2021

[CI stats] sharded test skew in test1/2 #60136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be fix shard inbalance #60206

Be fix shard inbalance #60206

	return_code = run_test(test_module, test_directory, options,
	launcher_cmd=mpiexec)
	else:
	return_code = run_test(test_module, test_directory, options)
	if return_code != 0:
	return return_code

Be fix shard inbalance #60206

Be fix shard inbalance #60206

Conversation

💊 CI failures summary and remediations

🕵️ 6 new failures recognized by patterns

pytorch_linux_bionic_py3_6_clang9_noarch_test (1/6)

pytorch_macos_10_13_py3_test (2/6)

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (3/6)

pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 (4/6)

pytorch_linux_xenial_py3_clang5_asan_test1 (5/6)

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test1 (6/6)

🚧 1 fixed upstream failure:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment