Checkpointing CUDA Applications with CRIU

Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications.

Checkpointing overview

Transparent, per-process checkpointing offers a middle ground between virtual machine checkpointing and application-driven checkpointing. Per-process checkpointing can be used in combination with containers to checkpoint the state of a complex application, facilitating use cases such as the following:

Fault tolerance, with periodic checkpoints

Preemption of lower-priority work on a single node by checkpointing the preempted task

Cluster scheduling with migration

Figure 1. Types of checkpointing

CRIU

CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility for Linux, maintained outside of NVIDIA, which can checkpoint and restore process trees. 

CRIU exposes its functionality through a command line program called criu and operates by checkpointing and restoring every kernel mode resource associated with a process. These resources include:

Anonymous memory

Threads

Regular files

Sockets

Pipes between checkpointed processes

As the behavior of these resources is specified by Linux and is independent of the underlying hardware, CRIU knows how to checkpoint and restore them. 

In contrast, NVIDIA GPUs provide functionality beyond that of a standard Linux kernel, so CRIU is not able to manage them. cuda-checkpoint adds this capability and can be used with CRIU to checkpoint and restore a CUDA application.

cuda-checkpoint

cuda-checkpoint checkpoints and restores the CUDA state of a single Linux process. It supports display driver version 550 and higher and can be downloaded from the /bin directory.

localhost$ cuda-checkpoint –help

CUDA checkpoint and restore utility.
Toggles the state of CUDA within a process between suspended and running.
Version 550.54.09. Copyright (C) 2024 NVIDIA Corporation. All rights reserved.

–toggle –pid <value>
Toggle the state of CUDA in the specified process.

–help
Print help message.

The cuda-checkpoint binary can toggle the CUDA state of a process, specified by PID, between suspended and running.  A running-to-suspended transition is called a suspend and the opposite transition is called a resume.

A process’s CUDA state is initially running. When cuda-checkpoint is used to suspend CUDA in a process, it follows these steps:

Any CUDA driver APIs that launch work, manage resources, or otherwise impact GPU state are locked.Already submitted CUDA work, including stream callbacks, is completed.Device memory is copied to the host, into allocations managed by the CUDA driver.All CUDA GPU resources are released.Table 1. cuda-checkpoint used to suspend CUDA

cuda-checkpoint does not suspend CPU threads, which may continue to safely interact with CUDA in one of the following ways: Calling runtime or driver APIs, which may block until CUDA is resumed or accessing host memory allocated by cudaMallocHost and similar APIs, which remains valid.

A suspended CUDA process no longer directly refers to any GPU hardware at the OS level and may therefore be checkpointed by a CPU checkpointing utility such as CRIU.

When a process’s CUDA state is resumed using cuda-checkpoint, it follows these steps:

GPUs are re-acquired by the process.Device memory is copied back to the GPU and GPU memory mappings are restored at their original addresses.CUDA objects such as streams and contexts are restored.CUDA driver APIs are unlocked.Table 2. CUDA state is resumed using cuda-checkpoint

At this point, CUDA calls unblock and CUDA can begin running on the GPU again.

Checkpointing example

This example uses cuda-checkpoint and CRIU to checkpoint a CUDA application called counter.  Every time that counter receives a packet, it increments GPU memory and replies with the updated value. The example code is also available in the GitHub repo.

#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

#define PORT 10000

__device__ int counter = 100;
__global__ void increment()
{
counter++;
}

int main(void)
{
cudaFree(0);

int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
sockaddr_in addr = {AF_INET, htons(PORT), inet_addr(“127.0.0.1”)};
bind(sock, (sockaddr *)&addr, sizeof addr);

while (true) {
char buffer[16] = {0};
sockaddr_in peer = {0};
socklen_t inetSize = sizeof peer;
int hCounter = 0;

recvfrom(sock, buffer, sizeof buffer, 0, (sockaddr *)&peer, &inetSize);

increment<<<1,1>>>();
cudaMemcpyFromSymbol(&hCounter, counter, sizeof counter);

size_t bytes = sprintf(buffer, “%d\n”, hCounter);
sendto(sock, buffer, bytes, 0, (sockaddr *)&peer, inetSize);
}
return 0;
}

You can build the counter application using nvcc.

localhost$ nvcc counter.cu -o counter

Save the counter PID for reference in subsequent commands:

Send counter a packet and observe the returned value. The initial value was 100 but the response is 101, showing that the GPU memory has changed since initialization.

localhost# echo hello | nc -u localhost 10000 -W 1
101

Use nvidia-smi to confirm that counter is running on a GPU:

localhost# nvidia-smi –query –display=PIDS | grep $PID
Process ID : 298027

Use cuda-checkpoint to suspend the counter CUDA state:

localhost# cuda-checkpoint –toggle –pid $PID

Use nvidia-smi to confirm that counter is no longer running on a GPU:

localhost# nvidia-smi –query –display=PIDS | grep $PID

Create a directory to hold the checkpoint image:

Use criu to checkpoint counter:

localhost# criu dump –shell-job –images-dir demo –tree $PID
[1]+ Killed ./counter

Confirm that counter is no longer running:

localhost# ps –pid $PID
PID TTY TIME CMD

Use criu to restore counter:

localhost# criu restore –shell-job –restore-detached –images-dir demo

Use cuda-checkpoint to resume the counter CUDA state:

localhost# cuda-checkpoint –toggle –pid $PID

Now that counter is fully restored, send it another packet. The response is 102, showing that earlier GPU operations were persisted correctly.

localhost# echo hello | nc -u localhost 10000 -W 1
102

Functionality

As of display driver version 550, checkpoint and restore functionality is still being actively developed. In particular, cuda-checkpoint has the following characteristics:

x64 only.

Acts upon a single process, not a process tree.

Doesn’t support UVM or IPC memory.

Doesn’t support GPU migration.

Waits for already-submitted CUDA work to finish before completing a checkpoint.

Doesn’t attempt to keep the process in a good state if an error (such as the presence of a UVM allocation) is encountered during checkpoint or restore.

These limitations will be addressed in subsequent display driver releases and will not require an update to the cuda-checkpoint utility itself. The cuda-checkpoint utility exposes functionality that is contained in the driver.

Summary

The cuda-checkpoint utility, when combined with CRIU, enables transparent per-process checkpointing of Linux applications. For more information, see the /NVIDIA/cuda-checkpoint GitHub repo.

Try checkpointing the counter application, or any other compatible CUDA application, on your own machine!