2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-23 02:17:22 +00:00

16 Commits

Author SHA1 Message Date
Radostin Stoyanov
bc3e4e880a cuda: remove redundant goto label
The `goto interrupt` label is unnecessary as the code directly
returns after `cuda_process_checkpoint_action()`.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-04-12 18:17:44 -07:00
Radostin Stoyanov
3284a83529 cuda: use pr_perror for libc function errors
When handing errors for functions such as `ptrace()`, `pipe()`, and
`fork()` it would be better to use `pr_perror` instead of `pr_err`
as it would include a message describing the encountered error.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-04-12 18:17:44 -07:00
Radostin Stoyanov
fc1dbc4915 cuda: disable CUDA plugin for pre-dump
Temporarily disable CUDA plugin for `criu pre-dump`.

pre-dump currently fails with the following error:

Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0                                  [heap]
Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0
Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895                              /dev/nvidia0
Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory
Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000
Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1

We plan to enable support for pre-dump by skipping nvidia mappings
in a separate patch.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
Radostin Stoyanov
59b022db35 cuda: prevent task lockup on timeout error
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:

	cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
	Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
	net: Unlock network
	cuda_plugin: finished cuda_plugin stage 0 err -1
	cuda_plugin: resuming devices on pid 84145
	cuda_plugin: Restore thread pid 84202 found for real pid 84145
	Unfreezing tasks into 1
		Unseizing 84145 into 1
	Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
Jesus Ramos
6d1da61482 cuda: Fix return value from CHECKPOINT_DEVICES hook so that dump's fail properly
cuda-checkpoint returns the positive CUDA error code when it runs into an issue
and passing that along as the return value would cause errors to get ignored

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-12-10 14:19:05 -08:00
Radostin Stoyanov
d6e5e7677f cuda: enable checkpoint support for paused tasks
If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.

This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.

In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.

Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-13 07:07:54 -08:00
Radostin Stoyanov
26dcc216c2 cuda: fix check for GPU device availability
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.

The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: #2509

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-11 19:35:01 -08:00
Radostin Stoyanov
f8f0e1df76 seize: enable support for frozen containers
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.

However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads.  To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.

To reflect this new behavior, the following changes are applied:
 - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
 - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
 - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`

Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.

Fixes: #2508

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-11 19:33:31 -08:00
Radostin Stoyanov
adf2c5be96 images/inventory: add field for enabled plugins
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.

We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:

- When the field is missing, it indicates that the checkpoint was
  created with a previous version of CRIU, and all plugins should be
  *enabled* during restore.

- When the field is empty, it indicates that no plugins were used during
  checkpointing. Thus, all plugins can be *disabled* during restore.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-10-26 22:18:22 -07:00
Andrei Vagin
6918998897 plugin/cuda: disable CUDA plugin if /dev/nvidiactl isn't present
The presence of /dev/nvidiactl indicates that the system has a
compatible NVIDIA GPU driver installed and that the GPU is accessible to
the operating system.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
Andrei Vagin
651df375bd criu: Allow disabling freeze cgroups
Some plugins (e.g., CUDA) may not function correctly when processes are
frozen using cgroups. This change introduces a mechanism to disable the
use of freeze cgroups during process seizing, even if explicitly
requested via the --freeze-cgroup option.

The CUDA plugin is updated to utilize this new mechanism to ensure
compatibility.

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
b1b3c14b17 cuda: unlock on timeout error
When attempting to checkpoint a container with CUDA processes,
CRIU could fail with the following error:

	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1
	Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with

In this situation, the target process is locked, but CRIU fails due to
a timeout and exits with an error. We need to make sure that the target
PID is unlocked in such case.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-19 15:23:42 -07:00
Radostin Stoyanov
ad66c27a11 cuda: fix launch cuda-checkpoint
When the cuda-checkpoint tool is not installed, execvp() is expected to
fail and return -1. In this case, we need to call exit() to terminate
the child process that was created earlier with fork().

Since CRIU can be used with applications that do not use CUDA, even
when the CUDA plugin is installed, this patch also updates the log
messages to show debug and warning (instead of error) when the
cuda-checkpoint tool is not found in $PATH.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
fde0b7ac69 cuda: don't leak fds to cuda-checkpoint
Leaking open file descriptors to third-party tools can lead
to security risks.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Radostin Stoyanov
c42b58f4fb plugin: enable multiple plugins for the same hook
CRIU provides two plugins for checkpoint/restore of GPU applications:
amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to
enable restore:

    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late)
    CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late)

However, CRIU currently does not support running more than one plugin
for the same hook. As a result, when both plugins are installed, the
resume function for CUDA applications is not executed. To fix this,
we need to make sure that both `plugin_resume_devices_late()` functions
return `-ENOTSUP` when restore is not supported.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-09-11 16:02:11 -07:00
Jesus Ramos
bf417dd050 criu/plugin: Add NVIDIA CUDA plugin
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an
r555 or higher driver along with the cuda-checkpoint binary.

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-09-11 16:02:11 -07:00