2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 01:51:51 +00:00
Radostin Stoyanov 59b022db35 cuda: prevent task lockup on timeout error
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:

	cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
	Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
	net: Unlock network
	cuda_plugin: finished cuda_plugin stage 0 err -1
	cuda_plugin: resuming devices on pid 84145
	cuda_plugin: Restore thread pid 84202 found for real pid 84145
	Unfreezing tasks into 1
		Unseizing 84145 into 1
	Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
2023-10-22 13:29:25 -07:00
2024-10-26 22:18:22 -07:00
2023-10-22 13:29:25 -07:00
2023-10-22 13:29:25 -07:00
2023-04-15 21:17:21 -07:00
2021-09-03 10:31:00 -07:00
2024-09-11 16:02:11 -07:00
2016-08-11 16:18:43 +03:00
2012-07-30 13:52:37 +04:00
2024-10-26 22:18:22 -07:00
2024-09-21 16:34:14 -07:00
2024-09-11 16:02:11 -07:00

X86_64 GCC Test Docker Test Podman Test CircleCI

CRIU -- A project to implement checkpoint/restore functionality for Linux

CRIU (stands for Checkpoint and Restore in Userspace) is a utility to checkpoint/restore Linux tasks.

Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space. There are some more projects doing C/R for Linux, and so far CRIU appears to be the most feature-rich and up-to-date with the kernel.

CRIU project is (almost) the never-ending story, because we have to always keep up with the Linux kernel supporting checkpoint and restore for all the features it provides. Thus we're looking for contributors of all kinds -- feedback, bug reports, testing, coding, writing, etc. Please refer to CONTRIBUTING.md if you would like to get involved.

The project started as the way to do live migration for OpenVZ Linux containers, but later grew to more sophisticated and flexible tool. It is currently used by (integrated into) OpenVZ, LXC/LXD, Docker, and other software, project gets tremendous help from the community, and its packages are included into many Linux distributions.

The project home is at http://criu.org. This wiki contains all the knowledge base for CRIU we have. Pages worth starting with are:

Checkpoint and restore of simple loop process

Advanced features

As main usage for CRIU is live migration, there's a library for it called P.Haul. Also the project exposes two cool core features as standalone libraries. These are libcompel for parasite code injection and libsoccr for TCP connections checkpoint-restore.

Live migration

True live migration using CRIU is possible, but doing all the steps by hands might be complicated. The phaul sub-project provides a Go library that encapsulates most of the complexity. This library and the Go bindings for CRIU are stored in the go-criu repository.

Parasite code injection

In order to get state of the running process CRIU needs to make this process execute some code, that would fetch the required information. To make this happen without killing the application itself, CRIU uses the parasite code injection technique, which is also available as a standalone library called libcompel.

TCP sockets checkpoint-restore

One of the CRIU features is the ability to save and restore state of a TCP socket without breaking the connection. This functionality is considered to be useful by itself, and we have it available as the libsoccr library.

Licence

The project is licensed under GPLv2 (though files sitting in the lib/ directory are LGPLv2.1).

All files in the images/ directory are licensed under the Expat license (so-called MIT). See the images/LICENSE file.

Description
No description provided
Readme 81 MiB
Languages
C 86%
Python 6.1%
Java 2.6%
Shell 2.6%
Makefile 2%
Other 0.7%