-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async host deep copy needs to be sequenced in the device queue #7261
Comments
Further comments on this.
Originally posted by @masterleinad in #7245 (comment) If I understand correctly, the following case may cause a problem using ExecSpace = Kokkos::DefaultExecutionSpace;
auto instances =
Kokkos::Experimental::partition_space(ExecSpace, 1, 1);
std::thread t1([&]() {
DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[0], dst, src, size);
});
std::thread t2([&]() {
DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[1], dst, src2, size);
});
t1.join();
t2.join(); To avoid a race condition, we need to make sure that the internal operations are not executed concurrently. |
What behavior would you expect? Without coordinating the two (separate) execution space instances in the code snippet, there is no way to know in when The case we are trying to support is for both threads to use the same execution space instance. In this case, we want to guarantee that the deep_copies are enqueued in some order so you wouldn't know if you get The relevant case here is for the execution space instance not to be able to access host memory (since we enqueue otherwise anyway). The current implementation would fence the execution space instance first and then enqueue the deep copy on the default host execution space. The order of the fences doesn't matter and doesn't create a race condition. |
Sure, so we consider something like using ExecSpace = Kokkos::DefaultExecutionSpace;
std::thread t1([&]() {
DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src, size);
});
std::thread t2([&]() {
DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src2, size);
});
t1.join();
t2.join();
In order to enqueue, we need to do the following procedure in
This way we can avoid a race condition between t1 and t2? |
I'm trying to argue that the current implementation is correct and we do what we promise. When I see "enqueue", I mean using the respective mechanism for a backend to make sure that kernels submitted to the same execution space instance don't run concurrently and, if submitted from the same thread, in the order of submission. |
If we submit tasks from different threads, I guess the internal |
No, we need to make sure that multiple calls to |
In the current implementation, |
Yes, we need to fix the implementation of |
Sure. Then, I may propose
Kokkos::DefaultHostExecutionSpace exec;
auto *internal_instance = exec.impl_internal_space_instance();
std::lock_guard<std::mutex> lock(internal_instance->m_instance_mutex);
internal_instance->fence(
"Kokkos::Impl::hostspace_parallel_deepcopy_async: fence before copy");
hostspace_parallel_deepcopy_async(exec, dst, src, n);
internal_instance->fence(
"Kokkos::Impl::hostspace_parallel_deepcopy_async: fence after copy"); If it is good for you, I will make a PR for this fix |
Sure, you would just need to make sure that all relevant execution spaces (Serial, OpenMP, HPX, Threads) have |
Sure, I will have a look.
Internally, the procedure changes depending on the size and alignment. It may perform |
The problematic parts are kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp Lines 63 to 93 in cad5bfb
kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp Lines 96 to 126 in cad5bfb
|
Thank you for comments. I will start working on this |
After some discussions with @dalg24, I am currently trying to get a race condition deliberately to develop a good test for this issue. The problematic lines can only be executed with unaligned data, so I am using However, these lines in while loop are in practice not executed at the same time, since it finishes quite rapidly. It is a thread that executes this part last who will update the values. I can deliberately have a race condition by inserting |
Have a look at our other thread-safety tests. We are doing something like kokkos/core/unit_test/TestExecSpaceThreadSafety.hpp Lines 65 to 76 in 8bc568f
only using one element in a View. You could have one thread execute such a kernel repeatedly and the other thread would copy repeatedly. You will probably need to use more than one element to hit the desired path in the deep_copy, though. |
That is a good idea. I was just trying to make a race with |
As pointed out in PR #7245, the asynchronous host deep copy needs to be sequenced in the device queue.
Originally posted by @crtrott in #7245 (comment)
The text was updated successfully, but these errors were encountered: