Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async host deep copy needs to be sequenced in the device queue #7261

Open
yasahi-hpc opened this issue Aug 26, 2024 · 16 comments
Open

Async host deep copy needs to be sequenced in the device queue #7261

yasahi-hpc opened this issue Aug 26, 2024 · 16 comments

Comments

@yasahi-hpc
Copy link
Contributor
yasahi-hpc commented Aug 26, 2024

As pointed out in PR #7245, the asynchronous host deep copy needs to be sequenced in the device queue.

Uhoh I think semantically we need to do actually something more complicated than this.
To be fully correct in this case I believe we need prevent any operations on exec to overlap with the deep_copy.
Which means we likely need to

  • acquire exec lock
  • wait for all exec operations to finish
  • perform the deep_copy (and wait for it to finish)
  • release the exec lock

Originally posted by @crtrott in #7245 (comment)

@yasahi-hpc
Copy link
Contributor Author

Further comments on this.

We only use this branch for a device execution space instance which isn't able to access the memory in question anyway. The only problem could arise if we have another host execution space (different from the one used here) accesses the memory from a different thread but we don't promise anything in that case anyway. If we happen to use the same execution space instance, we enqueue correctly.

Originally posted by @masterleinad in #7245 (comment)

If I understand correctly, the following case may cause a problem

using ExecSpace = Kokkos::DefaultExecutionSpace;
auto instances =
      Kokkos::Experimental::partition_space(ExecSpace, 1, 1);

std::thread t1([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[0], dst, src, size);
});
std::thread t2([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(instances[1], dst, src2, size);
});
t1.join();
t2.join();

To avoid a race condition, we need to make sure that the internal operations are not executed concurrently.
Do I understand the issue correctly? @crtrott @dalg24 @masterleinad

@masterleinad
Copy link
Contributor

What behavior would you expect? Without coordinating the two (separate) execution space instances in the code snippet, there is no way to know in when dst is accessed by the two threads and so you would always have a race condition.

The case we are trying to support is for both threads to use the same execution space instance. In this case, we want to guarantee that the deep_copies are enqueued in some order so you wouldn't know if you get src or src2 but at least there is no race condition.

The relevant case here is for the execution space instance not to be able to access host memory (since we enqueue otherwise anyway). The current implementation would fence the execution space instance first and then enqueue the deep copy on the default host execution space. The order of the fences doesn't matter and doesn't create a race condition.

@yasahi-hpc
Copy link
Contributor Author

What behavior would you expect? Without coordinating the two (separate) execution space instances in the code snippet, there is no way to know in when dst is accessed by the two threads and so you would always have a race condition.

The case we are trying to support is for both threads to use the same execution space instance. In this case, we want to guarantee that the deep_copies are enqueued in some order so you wouldn't know if you get src or src2 but at least there is no race condition.

Sure, so we consider something like

using ExecSpace = Kokkos::DefaultExecutionSpace;

std::thread t1([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src, size);
});
std::thread t2([&]() {
  DeepCopy<HostSpace, HostSpace, ExecSpace>(ExecSpace(), dst, src2, size);
});
t1.join();
t2.join();

The relevant case here is for the execution space instance not to be able to access host memory (since we enqueue otherwise anyway). The current implementation would fence the execution space instance first and then enqueue the deep copy on the default host execution space. The order of the fences doesn't matter and doesn't create a race condition.

In order to enqueue, we need to do the following procedure in DeepCopy?

  1. std::scoped_lock the mutex behind the execspace.
  2. cudaStreamSynchronize(stream) (or equivalent)
  3. do hostspace_parallel_deepcopy_async(dst, src, n);

This way we can avoid a race condition between t1 and t2?
I am not sure which mutex can be used for this purpose.

@masterleinad
Copy link
Contributor

I'm trying to argue that the current implementation is correct and we do what we promise. When I see "enqueue", I mean using the respective mechanism for a backend to make sure that kernels submitted to the same execution space instance don't run concurrently and, if submitted from the same thread, in the order of submission.

@yasahi-hpc
Copy link
Contributor Author

I'm trying to argue that the current implementation is correct and we do what we promise. When I see "enqueue", I mean using the respective mechanism for a backend to make sure that kernels submitted to the same execution space instance don't run concurrently and, if submitted from the same thread, in the order of submission.

If we submit tasks from different threads, I guess the internal hostspace_parallel_deepcopy_async can potentially run concurrently. Do you mean that nothing is promised if we submit tasks from different threads?

@masterleinad
Copy link
Contributor

No, we need to make sure that multiple calls to hostspace_parallel_deepcopy_async using the same execution space instance don't run concurrently but enqueue correctly on that execution space instance.

@yasahi-hpc
Copy link
Contributor Author

In the current implementation, hostspace_parallel_deepcopy_async runs on Kokkos::DefaultHostExecutionSpace. Thus, I feel hostspace_parallel_deepcopy_async can potentially run concurrently if this is called from different threads and the Kokkos::DefaultHostExecutionSpace is asynchronous like HPX.

hostspace_parallel_deepcopy_async

@masterleinad
Copy link
Contributor

Yes, we need to fix the implementation of hostspace_parallel_deepcopy_async but the call site is OK.

@yasahi-hpc
Copy link
Contributor Author

Sure. Then, I may propose

  1. Keep the call site DeepCopy as is
  2. Fix the overload of hostspace_parallel_deepcopy_async
  3. Introduce a lock inside to make sure that hostspace_parallel_deepcopy_async does not run concurrently.
Kokkos::DefaultHostExecutionSpace exec;
auto *internal_instance = exec.impl_internal_space_instance();

std::lock_guard<std::mutex> lock(internal_instance->m_instance_mutex);
internal_instance->fence(
    "Kokkos::Impl::hostspace_parallel_deepcopy_async: fence before copy");
hostspace_parallel_deepcopy_async(exec, dst, src, n);
internal_instance->fence(
    "Kokkos::Impl::hostspace_parallel_deepcopy_async: fence after copy");

If it is good for you, I will make a PR for this fix

@masterleinad
Copy link
Contributor

Sure, you would just need to make sure that all relevant execution spaces (Serial, OpenMP, HPX, Threads) have m_instance_mutex. I don't know about the latter two. Also, you can't have that mutex locked when calling a parallel construct so you shouldn't do that if the implementation just calls a single parallel construct. You would need to refactor if multiple parallel constructs are called or if it's something else touching the memory (memcpy for example) and a parallel construct.

@yasahi-hpc
Copy link
Contributor Author

Sure, you would just need to make sure that all relevant execution spaces (Serial, OpenMP, HPX, Threads) have m_instance_mutex.

Sure, I will have a look.

Also, you can't have that mutex locked when calling a parallel construct so you shouldn't do that if the implementation just calls a single parallel construct. You would need to refactor if multiple parallel constructs are called or if it's something else touching the memory (memcpy for example) and a parallel construct.

Internally, the procedure changes depending on the size and alignment. It may perform memcpy, parallel_for or byte based copies for non-aligned part. For asynchronous HPX backend, it always performs parallel_for, so we do not need a lock this case.

@masterleinad
Copy link
Contributor

The problematic parts are

// Both src and dst are aligned the same way with respect to 8 byte words
if (reinterpret_cast<ptrdiff_t>(src) % 8 ==
reinterpret_cast<ptrdiff_t>(dst) % 8) {
char* dst_c = reinterpret_cast<char*>(dst);
const char* src_c = reinterpret_cast<const char*>(src);
int count = 0;
// get initial bytes copied
while (reinterpret_cast<ptrdiff_t>(dst_c) % 8 != 0) {
*dst_c = *src_c;
dst_c++;
src_c++;
count++;
}
// copy the bulk of the data
double* dst_p = reinterpret_cast<double*>(dst_c);
const double* src_p = reinterpret_cast<const double*>(src_c);
Kokkos::parallel_for("Kokkos::Impl::host_space_deepcopy_double",
policy_t(exec, 0, (n - count) / 8),
[=](const ptrdiff_t i) { dst_p[i] = src_p[i]; });
// get final data copied
dst_c += ((n - count) / 8) * 8;
src_c += ((n - count) / 8) * 8;
char* dst_end = reinterpret_cast<char*>(dst) + n;
while (dst_c != dst_end) {
*dst_c = *src_c;
dst_c++;
src_c++;
}
return;
and
// Both src and dst are aligned the same way with respect to 4 byte words
if (reinterpret_cast<ptrdiff_t>(src) % 4 ==
reinterpret_cast<ptrdiff_t>(dst) % 4) {
char* dst_c = reinterpret_cast<char*>(dst);
const char* src_c = reinterpret_cast<const char*>(src);
int count = 0;
// get initial bytes copied
while (reinterpret_cast<ptrdiff_t>(dst_c) % 4 != 0) {
*dst_c = *src_c;
dst_c++;
src_c++;
count++;
}
// copy the bulk of the data
int32_t* dst_p = reinterpret_cast<int32_t*>(dst_c);
const int32_t* src_p = reinterpret_cast<const int32_t*>(src_c);
Kokkos::parallel_for("Kokkos::Impl::host_space_deepcopy_int",
policy_t(exec, 0, (n - count) / 4),
[=](const ptrdiff_t i) { dst_p[i] = src_p[i]; });
// get final data copied
dst_c += ((n - count) / 4) * 4;
src_c += ((n - count) / 4) * 4;
char* dst_end = reinterpret_cast<char*>(dst) + n;
while (dst_c != dst_end) {
*dst_c = *src_c;
dst_c++;
src_c++;
}
return;
.

@yasahi-hpc
Copy link
Contributor Author

Thank you for comments. I will start working on this

@yasahi-hpc
Copy link
Contributor Author

After some discussions with @dalg24, I am currently trying to get a race condition deliberately to develop a good test for this issue. The problematic lines can only be executed with unaligned data, so I am using short typed view.

However, these lines in while loop are in practice not executed at the same time, since it finishes quite rapidly. It is a thread that executes this part last who will update the values. I can deliberately have a race condition by inserting sleep to line 104 for example, but it does not make sense. Is there an alternative way to force a race condition within two threads?

@masterleinad
Copy link
Contributor

Have a look at our other thread-safety tests. We are doing something like

auto lambda = [=]() {
TEST_EXECSPACE exec;
for (int j = 0; j < M; ++j) {
Kokkos::parallel_for(
Kokkos::RangePolicy<TEST_EXECSPACE>(exec, 0, 1), KOKKOS_LAMBDA(int) {
Kokkos::atomic_store(view.data(), 0);
for (int i = 0; i < N; ++i) Kokkos::atomic_inc(view.data());
if (Kokkos::atomic_load(view.data()) != N)
Kokkos::atomic_store(error.data(), 1);
});
}
};

only using one element in a View.

You could have one thread execute such a kernel repeatedly and the other thread would copy repeatedly. You will probably need to use more than one element to hit the desired path in the deep_copy, though.

@yasahi-hpc
Copy link
Contributor Author

That is a good idea. I was just trying to make a race with deep_copies from two threads, but a race condition has never been observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@masterleinad @ajpowelsnl @yasahi-hpc and others