Keywords

1 Introduction

The use of GPUs for scientific applications is on the rise. High levels of parallelism of GPU architectures offer impressive performance and naturally fits the domain. When incorporating the use of GPUs into MPI programs, which is the de-facto standard when using clusters, making such applications resilient becomes even more challenging than before.

As the architecture of GPUs is based on the idea of CPU-managed non-interruptible kernel executions, current checkpointing practice assumes that all data has been taken off the GPUs and kernel execution is finished. Such a restriction does not constitute a major problem when applications only have short-running kernels all of which are being run synchronously. However, as kernels become more complex, we can observe that data increasingly often is maintained on the GPU only, and multiple kernels are launched asynchronously. Typical cluster designs provide fewer GPUs than CPU cores at individual nodes. Consequently, MPI+CUDA applications usually require several MPI processes to share GPUs, which increases their utilisation. Such high GPU utilisations make it more challenging to identify or create application states where all relevant data is on the host and no kernel is running. If these states occur less frequently than the Meantime Between Failure (MBTF) of the given hardware, resilience becomes problematic.

In this paper we propose a novel approach to deal with this challenge. We enable checkpointing of MPI+CUDA applications in a way that allows snapshots to happen in states where kernels are only partially finished and where snapshot-relevant data still resides on the GPUs of the system. We achieve this by extending the MPI checkpointing library Fault Tolerance Interface (FTI) [2] with a mechanismFootnote 1 for soft interrupts for GPU kernels proposed in [1]. The individual contributions of the paper are:

  • we extend FTI to enable data on GPUs to be part of checkpoints;

  • we extend FTI to mark kernels so that checkpoints can be performed before those kernels are completed;

  • we demonstrate the practical applicability of the proposed approachFootnote 2 on a given MPI+CUDA implementation of the Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH) application [9];

  • we provide some indicative performance evaluations quantifying the effects of the proposed extensions on the LULESH MPI+CUDA application.

2 Interrupting a Kernel

Checkpointing with a GPU kernel comes with two problems: saving/restoring the GPU context and interrupting a long-running kernel. The CUDA runtime implicitly creates an underlying context for communication between the host process and device. Once created, the context remains attached to the host process for its lifetime. If a process is checkpointed with an active context, restart from that checkpoint will fail because the restored context will be invalid. FTI does not preserve process states, so this work is not concerned with the GPU context save/restore problem and CUDA does not facilitate the interruption of a running GPU thread. Nevertheless, threads can be instrumented to interrupt themselves; this is done by ensuring that the first step of a thread’s execution is to check a host-controlled flag for permission to continue or to return. At runtime, CUDA threads are partitioned into groups called blocks, so a boolean array is used to keep track of executed blocks and is examined after the kernel returns. If all blocks have executed then the kernel is complete; otherwise, the kernel is relaunched. Figure 1 illustrates how a kernel is transformed into an interruptible one. For more details refer to [1]. Note that this approach does not work for kernels with explicit intra-block synchronisation.

Fig. 1.
figure 1

How to apply the technique to an original application.

3 FTI

FTI is a multilevel checkpointing library for large scale supercomputers. At extreme scale, supercomputers suffer from frequent failures due to the increased number of components. As scientific applications grow in scale, they are more prone to failures forcing to restart the execution. At the same time, they also use more data, and therefore the state to be saved upon a checkpoint is also increasing. This leads to an I/O bottleneck that could render scientific applications unable to make progress. To alleviate this problem, FTI makes use of multiple storage levels, including the global parallel file system (GPFS), as well as local storage inside the compute nodes. In particular FTI has four levels of checkpointing, providing a good trade-off between resilience and performance.

All the complexity of erasure coding, asynchronous transfer and managing multiple storage levels is hidden by FTI behind a simple interface that can be summarized in only four functions:

  • FTI_Init: This function initializes FTI with the configuration provided by the user in the configuration file.

  • FTI_Protect: This function is used to tell to FTI which are the variables that need to be checkpointed.

  • FTI_Snapshot: This function actually takes the checkpoint according to the frequency provided in the configuration file.

  • FTI_Finalize: This function frees the memory and clean up the different storage levels.

For most MPI applications, it suffices to insert calls to these FTI functions in order to render unprotected codes resilient against the majority of possible faults. Most scientific applications have one or more long running iterative computations at their core where over 90% of the runtime is being spent. These loops typically successively recompute the values of several key data structures until either a pre-determined number of iterations or a certain degree of data stability is being reached. The LULESH application that we use as our case study throughout this paper is no different. Consequently, adding resilience using FTI can be achieved by means of a few added function calls. The full FTI enabled code can be found at https://github.com/maxbaird/luleshMultiGPU_MPI/blob/integrating-fti/lulesh.cu. The core structure of that code looks like this:

figure a

The calls to FTI_protect inform FTI which data needs to be checkpointed. Here, we only show the protection of the iteration variable its as well as one of the data carrying arrays elemBC.raw(). Within the main computational loop, FTI_Snapshot is being called. It globally synchronises all MPI ranks to establish a global time and, provided the checkpoint interval has been exhausted, it triggers the actual checkpointing operation.

In case a failure happens, the same program is started. However, during the execution of FTI_Init the library notices that this is actually a restart and the data is restored to the latest checkpoint values, effectively skipping the loop iterations that had been completed before the fault had occurred. More details on FTI can be found in [2].

4 Extending FTI

4.1 Checkpointing GPU Data

The first extension enables the checkpointing of data that reside on the GPU rather than the CPU. This extension does not require any new FTI functions; instead, it suffices to extend the functionality of FTI_protect.

FTI_protect obtains a pointer to the data to be saved whenever a snapshot is being taken. Therefore, handling data residing on the GPU requires determining whether such a pointer is a valid host or device pointer. Conveniently, the CUDA API provides the cudaPointerGetAttributes function which makes it possible to distinguish host and device pointers. For device pointers, a device to host transfer is made prior to making a snapshot, and correspondingly on restart, the data is copied back to the device.

Things get slightly more involved due to the variety of memory models that CUDA supports. Unified Virtual Addressing (UVA) and Unified Memory (UM) introduced in CUDA versions 4.0 and 6.0 correspondingly, present a programmer with a coherent view of host and device memory [12]. In those cases explicit transfers are not required. Our extension to FTI_protect reflects this through further pointer attribute inspections.

4.2 Adding Kernel Suspension to FTI

Our second extension adds the ability to perform checkpoints during kernel execution. This constitutes the main technical contribution.

The key challenge here is that the underlying concept of FTI_Snapshot cannot easily be extended so that it could be used within GPU kernels. In comparison to MPI ranks, GPU kernels have several orders of magnitude higher levels of parallelism. Executions through millions or billions of threads on single GPUs are the norm and not the exception. Running the equivalent of FTI_Snapshot as part of such a massively parallel kernel would introduce massive overheads due to the increased synchronisation and the need to transfer back control to the host. Therefore, we execute FTI_Snapshot on the host, asynchronously to the kernel executions on the GPU. In case a snapshot needs to be performed, we use a technique for soft-interrupts of GPU-kernels as described in [1] to stop the current kernel and to initiate the snapshot process which is performed on the host.

To achieve this with a suitably simple interface extension of FTI, we add three new API functions:

  1. 1.

    FTI_Protect_Kernel: replaces the normal kernel launch;

  2. 2.

    FTI_Kernel_Def: wraps around the kernel header; and

  3. 3.

    FTI_Continue: needs to be inserted into the beginning of each protected kernel.

FTI_Protect_Kernel is responsible for the host-side code that manages the kernel launch. It triggers the initial kernel launch, potentially issues an interrupt from the host followed by the execution of a snapshot and repeats this activity until the kernel is completed.

FTI_Kernel_Def rewrites the kernel’s definition to add some extra parameters to handle the soft interrupts. Lastly, FTI_Continue_check adds code that is needed inside the kernel to enable soft-interrupts.

For the LULESH example, this means that we replace kernel invocations such as

figure b

by a call

figure c

From the user’s perspective this is merely the addition of a wrapping function call with three additional parameters. Our complete version of LULESH with kernel protection can be found at https://github.com/maxbaird/luleshMultiGPU_MPI/blob/integrating-fti-protecting-kernels/lulesh.cu.

4.3 Implementing FTI_Protect_Kernel

Roughly, FTI_Protect_Kernel translates into the following pseudo code:

figure d

FTI_Protect_Kernel first makes a call to FTI_kernel_init which initializes an object of type FTIT_KernelInfo with information on how to interrupt, checkpoint and restart the kernel. For efficiency, a kernel’s metadata is initialized once and cleared and reused as necessary if the kernel with same ID is launched again. The initialisation call is made irrespective of normal application execution or failure, if a kernel has associated metadata, this metadata will be restored.

After initialisation we have the kernel launch in a loop which only terminates after it has been executed by all MPI processes. This ensures consistency as it guarantees that all snapshot images stem from the same call to FTI_Snapshot across all MPI ranks. Once the kernel has been asynchronously launched the host waits for some period \(\delta _t\) (delta_t) before stopping the kernel and invoking FTI_Snapshot. The choice of \(\delta _t\) is tricky. The smaller \(\delta _t\) is the finer granular is the capability to stop protected kernels. While this is desirable, it comes for a price: whenever we invoke FTI_Snapshot, we synchronise across all MPI ranks which introduces noticeable overhead. On the other hand, choosing a large \(\delta _t\) could mean that (a) we heavily overrun our checkpointing interval or (b) the host is idly waiting while the kernel has already terminated. The former can be avoided by choosing \(\delta _t\) as a sufficiently small fraction of the checkpoint interval. We avoid the latter by implementing the waiting through an internal function FTI_wait which polls the GPU for the kernel’s completion status every 0.5 ms. Finally, after FTI_Snapshot has terminated, a call to MPI_allgather ensures that all ranks know about the completion status of all other MPI ranks.

4.4 What Happens at Checkpoint Time

Additional to the data that have been declared by the user for protection, FTI also saves metadata that contain information about the state of execution. For GPU kernels, this entails information about the degree of kernel completion which needs to be transferred from the GPU to the CPU and accumulated in the standard way of FTI.

4.5 What Happens at Restart Time

A previously failed application may be restored if at least one checkpoint was successful prior to failure. When executed, FTI will detect the execution as a restart and try to recover the most recent checkpoint data. The corresponding metadata of the recovered checkpoint is also loaded as part of the restart. The initialization phase of FTI triggers the restart process and subsequently calls a setup function for kernel protection. If the setup function detects the application is in recovery mode it attempts to load the metadata for all protected kernels. For an interruptible kernel, FTI_Protect_Kernel will rewrite the kernel launch as described, this time however, the kernel’s associated metadata will be restored instead of newly allocated. The restored metadata contains information about the kernel execution state which the kernel uses to accurately resume.

A previously complete kernel will have its metadata reset so that it can be launched again. However, If there are multiple kernels to be restored, a check is performed to ensure that all protected kernels are complete. Since at this point, whether the kernel is being relaunched immediately after failure or again through iteration cannot be determined. For the former case, execution must resume from the incomplete kernel. For the latter case, complete kernels that are not reset will still launch but do nothing since all blocks are marked as complete.

5 Experimental Setup

From LULESH we used a kernel that is called once for each iteration of its main executing loop and sufficiently oversubscribes the GPU. For our experiments only level 1 checkpoints were permitted. The other levels were effectively disabled by configuring their interval to be greater than longest running experiment, this was done as the time taken for checkpointing is not consistent for each level. Interruptions were simulated by prematurely terminating the application (via ctrl + c) during its execution with a minimum of one successful checkpoint. The amount of data captured at each checkpoint varies in relation to the application’s input, for our experiments the checkpoint data size ranged from 500 MB to 1 GB.

All experiments were executed using an AMD Opteron 6376 CPU running Scientific Linux Release 7.6 (Nitrogen), kernel version 3.10.0. The system has 1024 GB of RAM and an NVIDIA TITAN-XP GPU with 12 GB of global memory connected via PCIE x16. For our experiments CUDA version 10.0 was used with driver version 410.79.

6 Case Study

In this section we seek to examine the practical impact of our extension on the use case of LULESH. To this end, we have run experiments to examine three effects, the first experiment is aimed at figuring out whether we can stop a kernel prematurely. The second experiment demonstrates how much more interruptibility is possible with our proposed approach and the final experiment looks at the incurred overhead.

Restarting after the Checkpoint. Our first experiment is a sanity check for the extended FTI. We verify that when protecting a single kernel of our test application, which includes snapshotting of the data residing only on a GPU, we can use the saved data to successfully restart. We verified that the modified application successfully restarted, and that the result it computes is identical to the one computed by the original application. We also verified that the snapshot happened before the kernel completed, and that the GPU data have been actually stored in the snapshot. This raises our confidence that the proposed implementation works as expected.

Counting Snapshots. Our second experiment is concerned with the changes in the minimal snapshotting interval. The method used to verify this change is by counting the number of snapshots that we can do after we have protected the kernel. If the minimal snapshotting interval is determined by the runtime of the kernel, then the factor we decrease that interval is exactly the same as the factor by which the number of taken snapshots increased. This experiment includes two separate parts. In Fig. 2a we explore the limit case—how many more snapshots we could possibly do after protecting one kernel. In Fig. 2b we investigate a more realistic scenario when running resilient version of our application.

Fig. 2.
figure 2

Reducing checkpoint interval

Let us assume a simplistic execution model of the application with one protected kernel, one MPI process, and all the kernel launches are synchronous. In this case, the snapshot count is determined by the number of kernel interrupts we can make. The latter is determined by the oversubscription factor of a kernel—the number of threads divided by the number of threads the GPU can simultaneously execute. Therefore, in Fig. 2a we have related the number of snapshots we can possibly take with the oversubscription factor. For that we set \(\delta t\) to a very small value of 1 ms. We can observe two things. First we can clearly see that with more kernel oversubscription, more snapshots are possible. Secondly, the number of snapshots is noticeably larger than the oversubscription factor. This becomes possible due to asynchronous kernel launch. As kernel launches are queued, the actual launch on the GPU is delayed, and it might happen so that the interrupt comes before the kernel managed to execute any blocks. This means that in the asynchronous case, the kernel interrupts may happen even in the undersubscribed cases.

In Fig. 2b we investigate the snapshot increase when running our application with a realistic snapshotting interval of 4 min. We observe a similar pattern: the larger data set we use, the more threads we allocate per GPU kernels, therefore the number of snapshots we do increases as we would expect. The graphs shows average increase after running the same application 10 times.

Fig. 3.
figure 3

Overhead analysis

Overhead Analysis. Our third experiment examines the overhead that comes with our extension. As it turns out, measuring the overhead is tricky because it very much depends on the value chosen for \(\delta t\). As Fig. 3a shows, if \(\delta t\) is very small the overhead is quite noticeable. This is due to the MPI synchronization that must occur for the call to FTI_Snapshot at each interrupt. However, Fig. 3b shows that if \(\delta t\) increases this overhead is significantly reduced. The remaining observable overhead is attributed to each kernel thread always having to first check for the host’s permission to continue.

7 Related Work

GPU Proxies CRUM [4] achieves transparent CR by using a proxy process to decouple the application process’ state from the device driver state. This allows for checkpoints to be made without recording any active driver state. CRUM is geared toward applications with large memory footprints which make use of Unified Virtual Memory (UVM). The proxy process creates a shadow UVM region for each allocation made by the application process and then makes a corresponding real allocation via the CUDA driver. This setup is necessary because UVM has no API calls that can be intercepted. However, the restart process is based on the assumption of deterministic memory allocations that are made by the CUDA driver libraries which is not guaranteed by CUDA. It also raises the question of what happens if a restart needs to occur on a different device; while the allocations may be deterministic it does not mean they are consistent across devices. CRCUDA [17] and CheCL [19] are proxy based approaches that target CUDA and OpenCL respectively. Like CRUM, CRCUDA is transparent to the application process. Unlike CRUM, CRCUDA does not rely on deterministic memory allocations. Instead, it logs and replays CUDA API calls where BLCR [7] is responsible for saving and restoring the application’s state. CRCUDA does not support MPI or applications that make use of UVM. CheCL provides its own OpenCL library to intercept and redirect API calls to decouple the process from the OpenCL runtime.

GPU Virtualisation. A lot of related work is based on GPU Virtualisation such as [3, 5, 6, 13]. Virtual Machines (VMs) are attractive as they inherently serve as a buffer between the application and the physical device. This decoupling from the hardware makes checkpointing easier especially in the realm of CUDA where the GPU context cannot be checkpointed along with the application. VMGL [10] is marketed as an OpenGL cross-platform GPU independent virtualisation solution with suspend and resume capabilities. Suspend and resume is enabled through a shadow driver which keeps track of OpenGL’s context state. While OpenGL is supported by all GPU vendors, in reality it is used chiefly for rendering and not well suited for general purpose GPU computing. vCUDA [16] follows the identical approach of VMGL using CUDA instead of OpenGL. Unfortunately, VMs typically add more overhead via extra communication ultimately degrading performance.

Application Specific. CheCUDA [18] and NVCR [11] are currently obsolete CUDA based libraries because they depend on the CUDA context detaching cleanly before a checkpoint. Recent versions of CUDA no longer have this property. Consequently, correct restarts can not be guaranteed. CudaCR [15] is a CR library that is capable of capturing and rolling back the state within kernels in the event of a soft error. Similarly for soft errors, VOCL-FT [14] offers resilience against silent data corruption for OpenCL-accelerated applications. VOCL-FT is a library that virtualises the layer between the application and the accelerators to log commands issued to OpenCL so that they may be replayed later in case of failure. HiAL-Ckpt [20] is a checkpointing tool for the Brook+ language with directives to indicate where checkpoints should be made. However the development on Brook+ seems to have stopped with the last official release in 2004. HeteroCheckpoint [8] is a CUDA library and mainly focuses on how efficient checkpoints can be made by optimising the transfer of device data.

8 Conclusions and Future Work

This paper demonstrates a system that makes it possible to checkpoint/restore MPI applications with long running GPU kernels. Its distinctive feature is the ability to take snapshots without the necessity to wait for kernels completion. To our knowledge, none of the existing resilience tools can do this automatically.

The system is based on the FTI library—one of the standard resilience tools; and it is extended with the kernel interruption mechanism that we have described in [1]. As a result, by using the proposed tool, we significantly reduce the minimal interval at which the snapshots can be taken, making it possible to align the snapshot frequency with the MTBF of the system of interest. We apply our system to the real-word numerical MPI/CUDA application named LULESH. We verify that the proposed system is operational by running a number of snapshot/restores that include GPU data; and we demonstrate that the minimal snapshotting interval actually decreases.

Despite our system being fully operational and production-ready, it comes with a few limitations that immediately guide our future work. Currently, we do not verify that automatic kernel interruption mechanism is safe, assuming that this is a job of a programmer. For example, if a kernel uses explicit intra-block synchronisation, our mechanism may introduce a deadlock. This is less of a problem for CUDA systems prior to the version 9, as intra-block synchronisation was not supported, and use of manual spinlocks are not advised by the manual. Latest CUDA architectures allow for such synchronisations which we would like to attempt to detect by means of analysing CUDA kernels. Work is also required to identify and reduce the overhead observed from our extension.

Currently, the time we have to wait to interrupt the running kernel is equal to the time it takes to execute one thread of a kernel. If this time happens to be too large, we need to make our interruption mechanism smarter—we can check for interrupts not only at the beginning of each block, but also while the thread is running. This would require a more sophisticated analysis of kernels, that would take into account dataflow and controlflow.