Async host deep copy needs to be sequenced in the device queue #7261

yasahi-hpc · 2024-08-26T16:12:23Z

As pointed out in PR #7245, the asynchronous host deep copy needs to be sequenced in the device queue.

Uhoh I think semantically we need to do actually something more complicated than this.
To be fully correct in this case I believe we need prevent any operations on exec to overlap with the deep_copy.
Which means we likely need to

acquire exec lock

wait for all exec operations to finish

perform the deep_copy (and wait for it to finish)

release the exec lock

Originally posted by @crtrott in #7245 (comment)

yasahi-hpc · 2024-09-03T18:18:55Z

Further comments on this.

We only use this branch for a device execution space instance which isn't able to access the memory in question anyway. The only problem could arise if we have another host execution space (different from the one used here) accesses the memory from a different thread but we don't promise anything in that case anyway. If we happen to use the same execution space instance, we enqueue correctly.

Originally posted by @masterleinad in #7245 (comment)

If I understand correctly, the following case may cause a problem

using ExecSpace = Kokkos::DefaultExecutionSpace;
auto instances =
      Kokkos::Experimental::partition_space(ExecSpace, 1, 1);

std::thread t1([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[0], dst, src, size);
});
std::thread t2([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[1], dst, src2, size);
});
t1.join();
t2.join();

To avoid a race condition, we need to make sure that the internal operations are not executed concurrently.
Do I understand the issue correctly? @crtrott @dalg24 @masterleinad

masterleinad · 2024-09-03T18:55:03Z

What behavior would you expect? Without coordinating the two (separate) execution space instances in the code snippet, there is no way to know in when dst is accessed by the two threads and so you would always have a race condition.

The case we are trying to support is for both threads to use the same execution space instance. In this case, we want to guarantee that the deep_copies are enqueued in some order so you wouldn't know if you get src or src2 but at least there is no race condition.

The relevant case here is for the execution space instance not to be able to access host memory (since we enqueue otherwise anyway). The current implementation would fence the execution space instance first and then enqueue the deep copy on the default host execution space. The order of the fences doesn't matter and doesn't create a race condition.

yasahi-hpc · 2024-09-04T12:33:40Z

What behavior would you expect? Without coordinating the two (separate) execution space instances in the code snippet, there is no way to know in when dst is accessed by the two threads and so you would always have a race condition.

The case we are trying to support is for both threads to use the same execution space instance. In this case, we want to guarantee that the deep_copies are enqueued in some order so you wouldn't know if you get src or src2 but at least there is no race condition.

Sure, so we consider something like

using ExecSpace = Kokkos::DefaultExecutionSpace;

std::thread t1([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src, size);
});
std::thread t2([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src2, size);
});
t1.join();
t2.join();

The relevant case here is for the execution space instance not to be able to access host memory (since we enqueue otherwise anyway). The current implementation would fence the execution space instance first and then enqueue the deep copy on the default host execution space. The order of the fences doesn't matter and doesn't create a race condition.

In order to enqueue, we need to do the following procedure in DeepCopy?

std::scoped_lock the mutex behind the execspace.
cudaStreamSynchronize(stream) (or equivalent)
do hostspace_parallel_deepcopy_async(dst, src, n);

This way we can avoid a race condition between t1 and t2?
I am not sure which mutex can be used for this purpose.

masterleinad · 2024-09-04T12:43:22Z

I'm trying to argue that the current implementation is correct and we do what we promise. When I see "enqueue", I mean using the respective mechanism for a backend to make sure that kernels submitted to the same execution space instance don't run concurrently and, if submitted from the same thread, in the order of submission.

yasahi-hpc · 2024-09-04T13:13:45Z

I'm trying to argue that the current implementation is correct and we do what we promise. When I see "enqueue", I mean using the respective mechanism for a backend to make sure that kernels submitted to the same execution space instance don't run concurrently and, if submitted from the same thread, in the order of submission.

If we submit tasks from different threads, I guess the internal hostspace_parallel_deepcopy_async can potentially run concurrently. Do you mean that nothing is promised if we submit tasks from different threads?

masterleinad · 2024-09-04T13:20:23Z

No, we need to make sure that multiple calls to hostspace_parallel_deepcopy_async using the same execution space instance don't run concurrently but enqueue correctly on that execution space instance.

yasahi-hpc · 2024-09-04T13:48:06Z

In the current implementation, hostspace_parallel_deepcopy_async runs on Kokkos::DefaultHostExecutionSpace. Thus, I feel hostspace_parallel_deepcopy_async can potentially run concurrently if this is called from different threads and the Kokkos::DefaultHostExecutionSpace is asynchronous like HPX.

hostspace_parallel_deepcopy_async

masterleinad · 2024-09-04T14:06:41Z

Yes, we need to fix the implementation of hostspace_parallel_deepcopy_async but the call site is OK.

yasahi-hpc · 2024-09-05T11:39:04Z

Sure. Then, I may propose

Keep the call site DeepCopy as is
Fix the overload of hostspace_parallel_deepcopy_async
Introduce a lock inside to make sure that hostspace_parallel_deepcopy_async does not run concurrently.

Kokkos::DefaultHostExecutionSpace exec;
auto *internal_instance = exec.impl_internal_space_instance();

std::lock_guard<std::mutex> lock(internal_instance->m_instance_mutex);
internal_instance->fence(
    "Kokkos::Impl::hostspace_parallel_deepcopy_async: fence before copy");
hostspace_parallel_deepcopy_async(exec, dst, src, n);
internal_instance->fence(
    "Kokkos::Impl::hostspace_parallel_deepcopy_async: fence after copy");

If it is good for you, I will make a PR for this fix

masterleinad · 2024-09-05T11:59:51Z

Sure, you would just need to make sure that all relevant execution spaces (Serial, OpenMP, HPX, Threads) have m_instance_mutex. I don't know about the latter two. Also, you can't have that mutex locked when calling a parallel construct so you shouldn't do that if the implementation just calls a single parallel construct. You would need to refactor if multiple parallel constructs are called or if it's something else touching the memory (memcpy for example) and a parallel construct.

yasahi-hpc · 2024-09-05T12:27:05Z

Sure, you would just need to make sure that all relevant execution spaces (Serial, OpenMP, HPX, Threads) have m_instance_mutex.

Sure, I will have a look.

Also, you can't have that mutex locked when calling a parallel construct so you shouldn't do that if the implementation just calls a single parallel construct. You would need to refactor if multiple parallel constructs are called or if it's something else touching the memory (memcpy for example) and a parallel construct.

Internally, the procedure changes depending on the size and alignment. It may perform memcpy, parallel_for or byte based copies for non-aligned part. For asynchronous HPX backend, it always performs parallel_for, so we do not need a lock this case.

masterleinad · 2024-09-05T12:49:27Z

The problematic parts are

kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp

Lines 63 to 93 in cad5bfb

    
           // Both src and dst are aligned the same way with respect to 8 byte words 
        
           if (reinterpret_cast<ptrdiff_t>(src) % 8 == 
        
               reinterpret_cast<ptrdiff_t>(dst) % 8) { 
        
             char* dst_c       = reinterpret_cast<char*>(dst); 
        
             const char* src_c = reinterpret_cast<const char*>(src); 
        
             int count         = 0; 
        
             // get initial bytes copied 
        
             while (reinterpret_cast<ptrdiff_t>(dst_c) % 8 != 0) { 
        
               *dst_c = *src_c; 
        
               dst_c++; 
        
               src_c++; 
        
               count++; 
        
             } 
        
             // copy the bulk of the data 
        
             double* dst_p       = reinterpret_cast<double*>(dst_c); 
        
             const double* src_p = reinterpret_cast<const double*>(src_c); 
        
             Kokkos::parallel_for("Kokkos::Impl::host_space_deepcopy_double", 
        
                                  policy_t(exec, 0, (n - count) / 8), 
        
                                  [=](const ptrdiff_t i) { dst_p[i] = src_p[i]; }); 
        
             // get final data copied 
        
             dst_c += ((n - count) / 8) * 8; 
        
             src_c += ((n - count) / 8) * 8; 
        
             char* dst_end = reinterpret_cast<char*>(dst) + n; 
        
             while (dst_c != dst_end) { 
        
               *dst_c = *src_c; 
        
               dst_c++; 
        
               src_c++; 
        
             } 
        
             return;

and

kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp

Lines 96 to 126 in cad5bfb

    
           // Both src and dst are aligned the same way with respect to 4 byte words 
        
           if (reinterpret_cast<ptrdiff_t>(src) % 4 == 
        
               reinterpret_cast<ptrdiff_t>(dst) % 4) { 
        
             char* dst_c       = reinterpret_cast<char*>(dst); 
        
             const char* src_c = reinterpret_cast<const char*>(src); 
        
             int count         = 0; 
        
             // get initial bytes copied 
        
             while (reinterpret_cast<ptrdiff_t>(dst_c) % 4 != 0) { 
        
               *dst_c = *src_c; 
        
               dst_c++; 
        
               src_c++; 
        
               count++; 
        
             } 
        
             // copy the bulk of the data 
        
             int32_t* dst_p       = reinterpret_cast<int32_t*>(dst_c); 
        
             const int32_t* src_p = reinterpret_cast<const int32_t*>(src_c); 
        
             Kokkos::parallel_for("Kokkos::Impl::host_space_deepcopy_int", 
        
                                  policy_t(exec, 0, (n - count) / 4), 
        
                                  [=](const ptrdiff_t i) { dst_p[i] = src_p[i]; }); 
        
             // get final data copied 
        
             dst_c += ((n - count) / 4) * 4; 
        
             src_c += ((n - count) / 4) * 4; 
        
             char* dst_end = reinterpret_cast<char*>(dst) + n; 
        
             while (dst_c != dst_end) { 
        
               *dst_c = *src_c; 
        
               dst_c++; 
        
               src_c++; 
        
             } 
        
             return;

.

yasahi-hpc · 2024-09-05T13:52:42Z

Thank you for comments. I will start working on this

yasahi-hpc · 2024-09-12T13:24:53Z

After some discussions with @dalg24, I am currently trying to get a race condition deliberately to develop a good test for this issue. The problematic lines can only be executed with unaligned data, so I am using short typed view.

However, these lines in while loop are in practice not executed at the same time, since it finishes quite rapidly. It is a thread that executes this part last who will update the values. I can deliberately have a race condition by inserting sleep to line 104 for example, but it does not make sense. Is there an alternative way to force a race condition within two threads?

masterleinad · 2024-09-12T13:35:33Z

Have a look at our other thread-safety tests. We are doing something like

kokkos/core/unit_test/TestExecSpaceThreadSafety.hpp

Lines 65 to 76 in 8bc568f

    
           auto lambda = [=]() { 
        
             TEST_EXECSPACE exec; 
        
             for (int j = 0; j < M; ++j) { 
        
               Kokkos::parallel_for( 
        
                   Kokkos::RangePolicy<TEST_EXECSPACE>(exec, 0, 1), KOKKOS_LAMBDA(int) { 
        
                     Kokkos::atomic_store(view.data(), 0); 
        
                     for (int i = 0; i < N; ++i) Kokkos::atomic_inc(view.data()); 
        
                     if (Kokkos::atomic_load(view.data()) != N) 
        
                       Kokkos::atomic_store(error.data(), 1); 
        
                   }); 
        
             } 
        
           };

only using one element in a View.

You could have one thread execute such a kernel repeatedly and the other thread would copy repeatedly. You will probably need to use more than one element to hit the desired path in the deep_copy, though.

yasahi-hpc · 2024-09-12T13:40:07Z

That is a good idea. I was just trying to make a race with deep_copies from two threads, but a race condition has never been observed.

ajpowelsnl added the Kokkos-Core label Aug 26, 2024

ajpowelsnl mentioned this issue Aug 26, 2024

Fix Deep_copy(serial_exec, dst, src) with multiple host backends #7245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async host deep copy needs to be sequenced in the device queue #7261

Async host deep copy needs to be sequenced in the device queue #7261

Async host deep copy needs to be sequenced in the device queue #7261

Async host deep copy needs to be sequenced in the device queue #7261

Comments