mir/criu - criu - Mike's Git repositories

mir/criu

mirror of https://github.com/checkpoint-restore/criu synced 2025-08-28 12:57:57 +00:00

Author	SHA1	Message	Date
Radostin Stoyanov	f8f0e1df76	seize: enable support for frozen containers Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: #2508 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-11-11 19:33:31 -08:00
Radostin Stoyanov	adf2c5be96	images/inventory: add field for enabled plugins This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between unset and missing field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be enabled during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be disabled during restore. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-10-26 22:18:22 -07:00
Andrei Vagin	6918998897	plugin/cuda: disable CUDA plugin if /dev/nvidiactl isn't present The presence of /dev/nvidiactl indicates that the system has a compatible NVIDIA GPU driver installed and that the GPU is accessible to the operating system. Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-19 15:23:42 -07:00
Andrei Vagin	651df375bd	criu: Allow disabling freeze cgroups Some plugins (e.g., CUDA) may not function correctly when processes are frozen using cgroups. This change introduces a mechanism to disable the use of freeze cgroups during process seizing, even if explicitly requested via the --freeze-cgroup option. The CUDA plugin is updated to utilize this new mechanism to ensure compatibility. Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	b1b3c14b17	cuda: unlock on timeout error When attempting to checkpoint a container with CUDA processes, CRIU could fail with the following error: Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 1 Error (cuda_plugin.c:143): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call Error (cuda_plugin.c:384): cuda_plugin: PAUSE_DEVICES failed with In this situation, the target process is locked, but CRIU fails due to a timeout and exits with an error. We need to make sure that the target PID is unlocked in such case. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-19 15:23:42 -07:00
Radostin Stoyanov	ad66c27a11	cuda: fix launch cuda-checkpoint When the cuda-checkpoint tool is not installed, execvp() is expected to fail and return -1. In this case, we need to call exit() to terminate the child process that was created earlier with fork(). Since CRIU can be used with applications that do not use CUDA, even when the CUDA plugin is installed, this patch also updates the log messages to show debug and warning (instead of error) when the cuda-checkpoint tool is not found in $PATH. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org> Signed-off-by: Andrei Vagin <avagin@google.com>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	fde0b7ac69	cuda: don't leak fds to cuda-checkpoint Leaking open file descriptors to third-party tools can lead to security risks. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Radostin Stoyanov	c42b58f4fb	plugin: enable multiple plugins for the same hook CRIU provides two plugins for checkpoint/restore of GPU applications: amdgpu and cuda. Both plugins use the `RESUME_DEVICES_LATE` hook to enable restore: CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, amdgpu_plugin_resume_devices_late) CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__RESUME_DEVICES_LATE, cuda_plugin_resume_devices_late) However, CRIU currently does not support running more than one plugin for the same hook. As a result, when both plugins are installed, the resume function for CUDA applications is not executed. To fix this, we need to make sure that both `plugin_resume_devices_late()` functions return `-ENOTSUP` when restore is not supported. Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2024-09-11 16:02:11 -07:00
Jesus Ramos	bf417dd050	criu/plugin: Add NVIDIA CUDA plugin Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary. Signed-off-by: Jesus Ramos <jeramos@nvidia.com>	2024-09-11 16:02:11 -07:00

9 Commits