This patch implements the entire logic to enable the offloading of
buffer object content restoration.
The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.
It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.
This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, parallel restore only focuses on the single-process
situation. Therefore, it needs an interface to know if there is only one
process to restore. This patch adds a `has_children` function in
`pstree.h` and replaces some existing implementations with this
function.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations. However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be parallelized with other operations to speed up the
process.
Instead of launching a thread in child processes for parallelization,
this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU
process to handle these restore operations. This is because the
restoration of memory state in the restore blob is one of the most
time-consuming parts of all restore logic. The main CRIU process can
easily parallelize these operations, whereas parallelizing in threads
within child processes is challenging.
- POST_FORKING
*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Building CRIU on Ubuntu 20.04 fails with the following error:
criu/sk-inet.c: In function 'can_dump_ipproto':
criu/sk-inet.c:131:16: error: 'IPPROTO_MPTCP' undeclared (first use in this function); did you mean 'IPPROTO_MTP'?
131 | if (proto == IPPROTO_MPTCP)
| ^~~~~~~~~~~~~
| IPPROTO_MTP
Add definition for MPTCP to fix this error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The container checkpointing procedure in Kubernetes freezes running
containers to create a consistent snapshot of both the runtime state
and the rootfs of the container. However, when checkpointing a GPU
container, the container must be unfrozen before invoking the
cuda-checkpoint tool.
This is achieved in prepare_freezer_for_interrupt_only_mode(), which
needs to be called before the PAUSE_DEVICES hook. The patch introducing
this functionality fixes this problem for containers with multiple
processes. However, if the container has a single process,
prepare_freezer_for_interrupt_only_mode() must be invoked immediately
before the PAUSE_DEVICES hook.
Fixes: #2514
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In 0a7c5fd1bd8d1e49e273b51ff39af473d6c68cbc we swapped the BSD
implementation of strlcat and strlcpy in favor of our own replacement.
The checks and the predefined macros are not needed anymore.
Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
In some cases, they might not work in virtual machines if the hypervisor
doesn't virtualize them. For example, they don't work in AMD SEV virtual
machines if the Debug Virtualization extension isn't supported or isn't
enabled in SEV_FEATURES.
Fixes#2658
Signed-off-by: Andrei Vagin <avagin@gmail.com>
With Go version 1.24, ListenConfig now uses MPTCP by default [1].
Checkpoint/restore for this protocol is not currently supported
and adding support requires kernel changes that are not trivial
to implement. As a result, checkpointing of many containers that
run Go programs is likely to fail with the following error [2]:
(00.026522) Error (criu/sk-inet.c:130): inet: Unsupported proto 262 for socket 2f9bc5
This patch adds a message with suggested workaround for this problem.
[1] https://go.dev/doc/go1.24#netpkgnet
[2] https://github.com/checkpoint-restore/criu/issues/2655
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
It makes root mount readonly and checks that it is still readonly after
migration.
Make zdtm/static writable for logs via "bind" desc option.
v2: explain why we don't have explicit rw/ro flag check
v3: use new zdtm "bind" desc option
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Add {'bind': 'path/to/bindmount'} zdtm descriptor option, so that in
test mount namespace a directory bindmount can be created before running
the test.
This is useful to leave test directory writable (e.g. for logs) while
the test makes root mount readonly. note: We create this bindmount early
so that all test files are opened on it initially and not on the below
mount. Will be used in mnt_ro_root test.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Mount flags belong to mount and mount namespace of the Container, so we
should preserve them, as Container user will not expect mounts switching
between ro and rw over c/r.
Fixes: #2632
v5: fix both mount-v1 and mount-v2
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Building CRIU package on Debian 11 aarch64 fails with
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:32:31: error: storage size of 'paca' isn't known
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c:33:31: error: storage size of 'pacg' isn't known
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:47:15: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (hwcaps & HWCAP_PACA) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:47:15: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c:53:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:82:15: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (hwcaps & HWCAP_PACG) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:88:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:33:31: error: unused variable 'pacg' [-Werror=unused-variable]
struct user_pac_generic_keys pacg;
^~~~
criu/arch/aarch64/crtools.c:32:31: error: unused variable 'paca' [-Werror=unused-variable]
struct user_pac_address_keys paca;
^~~~
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:227:31: error: storage size of 'upaca' isn't known
struct user_pac_address_keys upaca;
^~~~~
criu/arch/aarch64/crtools.c:228:31: error: storage size of 'upacg' isn't known
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:241:18: error: 'HWCAP_PACA' undeclared (first use in this function); did you mean 'HWCAP_FCMA'?
if (!(hwcaps & HWCAP_PACA)) {
^~~~~~~~~~
HWCAP_FCMA
criu/arch/aarch64/crtools.c:255:44: error: 'NT_ARM_PACA_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACA_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function)
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
criu/arch/aarch64/crtools.c:268:18: error: 'HWCAP_PACG' undeclared (first use in this function); did you mean 'HWCAP_AES'?
if (!(hwcaps & HWCAP_PACG)) {
^~~~~~~~~~
HWCAP_AES
criu/arch/aarch64/crtools.c:275:44: error: 'NT_ARM_PACG_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_SVE'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PACG_KEYS, &iov))) {
^~~~~~~~~~~~~~~~
NT_ARM_SVE
criu/arch/aarch64/crtools.c:233:6: error: variable 'ret' set but not used [-Werror=unused-but-set-variable]
int ret;
^~~
criu/arch/aarch64/crtools.c:228:31: error: unused variable 'upacg' [-Werror=unused-variable]
struct user_pac_generic_keys upacg;
^~~~~
criu/arch/aarch64/crtools.c:227:31: error: unused variable 'upaca' [-Werror=unused-variable]
struct user_pac_address_keys upaca;
^~~~~
This patch adds the missing constants and structs if undefined.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.
Fixes#2650
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Currently we save FP regs before parasite code runs, and restore after
for --leave-running, --check-only, and in case of errors. In case of
errors the error may have happened before FP regs were saved, so we
should only restore them if they were actually saved.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
On a RHEL 8 based system building CRIU fails with:
criu/arch/aarch64/crtools.c: In function 'save_pac_keys':
criu/arch/aarch64/crtools.c:73:39: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
ret = ptrace(PTRACE_GETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov);
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
criu/arch/aarch64/crtools.c:73:39: note: each undeclared identifier is reported only once for each function it appears in
criu/arch/aarch64/crtools.c: In function 'arch_ptrace_restore':
criu/arch/aarch64/crtools.c:261:44: error: 'NT_ARM_PAC_ENABLED_KEYS' undeclared (first use in this function); did you mean 'NT_ARM_PACA_KEYS'?
if ((ret = ptrace(PTRACE_SETREGSET, pid, NT_ARM_PAC_ENABLED_KEYS, &iov))) {
^~~~~~~~~~~~~~~~~~~~~~~
NT_ARM_PACA_KEYS
This adds the missing define if it is undefined.
Signed-off-by: Adrian Reber <areber@redhat.com>
The `goto interrupt` label is unnecessary as the code directly
returns after `cuda_process_checkpoint_action()`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
When handing errors for functions such as `ptrace()`, `pipe()`, and
`fork()` it would be better to use `pr_perror` instead of `pr_err`
as it would include a message describing the encountered error.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Thomas Gleixner introduced the new interface to create posix timers
with specifed timer IDs:
ec2d0c0462
Previously, CRIU recreated timers by repeatedly creating and deleting
them until the desired ID was reached. This approach isn't fast,
especially for timers with large IDs. For example, restoring two timers
with IDs 1000000 and 2000000 took approximately 1.5 seconds.
The new `prctl()` based interface allows direct creation of timers with
specified IDs, reducing the restoration time to around 3 microseconds
for the same example.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.
This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
This is highly confusing, and it seems that the ret variable
is not handled in the subsequent process.
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
When using pr_err in signal handler, locking is used
in an unsafe manner. If another signal happens while holding the
lock, deadlock can happen.
To fix this, we can introduce mutex_trylock similar to
pthread_mutex_trylock that returns immediately. Due to the fact
that lock is used only for writing first_err, this change garantees
that deadlock cannot happen.
Fixes: #358
Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com>
free_userns_maps is called to clean up uid/gid map when the dump
finishes. If we try to clean up these maps in error cases, it can lead
to double free panic. So just skip cleaning up these maps and let
free_userns_maps do its job.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
There are a couple of tests that require the iptables binary.
Instead of adding a checkskip script, which could also handle this,
this change now uses CRIU's feature detection to see if the CRIU
feature 'has_ipt_legacy' exists.
Signed-off-by: Adrian Reber <areber@redhat.com>
If the tests in others/rpc are failing no information about that error
can be seen in a CI run. This change displays the log files if the test
fails.
Signed-off-by: Adrian Reber <areber@redhat.com>
The tests in others/rpc are running as non-root and
fail silently if the nftables network locking backend is used.
This switches those tests to skip the network locking.
Signed-off-by: Adrian Reber <areber@redhat.com>
The building section also contains the information how to change the
network locking backend without source code changes.
Signed-off-by: Adrian Reber <areber@redhat.com>
As different Linux distributions are switching away from iptables
to nftables, this makes it easier to compile CRIU with a different
default network locking backend. Instead of changing the source
code it is now possible to select the nft backend like this:
make NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES
Signed-off-by: Adrian Reber <areber@redhat.com>
Let's change the data types of `nbucket` and `nchain` to uint32.
This should fix the following compile-time error on arm32:
/criu/criu/pie/util-vdso.c:336: undefined reference to `__aeabi_uldivmod'
Signed-off-by: Andrei Vagin <avagin@google.com>
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.
As they are per-process protperties, we can save/restore them just for
one thread.
Signed-off-by: Andrei Vagin <avagin@google.com>
Threads are put into cgroups through the cgroupd thread, which
communicates with other threads using a socketpair.
Previously, each thread received a dup'd copy of the socket, and did
the following
sendmsg(socket_dup_fd, my_cgroup_set);
// wait for ack.
while (1) {
recvmsg(socket_dup_fd, &h, MSG_PEEK);
if (h.pid != my_pid) continue;
recvmsg(socket_dup_fd, &h, 0);
}
close(socket_dup_fd);
When restoring many threads, many threads would be spinning in the
above loop waiting for their PID to appear.
In my test-case, restoring a process with a 11.5G heap and 491 threads
could take anywhere between 10 seconds and 60 seconds to complete.
To avoid the spinning, we drop the loop and MSG_PEEK, and add a lock
around the above code. This does not decrease parallelism, as the
cgroupd daemon uses a single thread anyway.
With the lock in place, the same restore consistently takes around 10
seconds on my machine (Thinkpad P14s, AMD Ryzen 8840HS).
There is a similar "daemon" thread for user namespaces. That already
is protected with a similar userns_sync_lock in __userns_call().
Fixes#2614
Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
* Hash buckets is an array of 32-bit words. While DT_HASH is 32-bit on
most platforms except s390 (where it's 64-bit).
* The bloom filter word size differs between 32-bit and 64-bit ELF
files. This commit adjusts the code to handle both cases.
Signed-off-by: Andrei Vagin <avagin@google.com>
Currently CRIU has the possibility to specify a LSM label during
restore. Unfortunately the information is completely ignored in the case
of SELinux.
This change selects the lsm label from the user if it is provided and
else the label from the checkpoint image is used.
Signed-off-by: Adrian Reber <areber@redhat.com>
With Python 3.13, the `subprocess` module now uses the
`posix_spawn()` function [1], which requires the `signal`
module to be imported.
Fixes: #2607
[1] https://docs.python.org/3/whatsnew/3.13.html#subprocess
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Add relevant elf header constants and notes for the arm platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
Add relevant elf header constants and notes for the aarch64 platform
to enable coredump generation.
Signed-off-by: समीर सिंह Sameer Singh <lumarzeli30@gmail.com>
strstartswith() function is incorrect choice for finding parent
directory so i change it to issubpath() function
Signed-off-by: Dmitrii Chervov <dschervov1@yandex.ru>
Currently Fedora rawhide based CI runs fail with:
/bin/sh: line 1: awk: command not found
Let's install it.
Signed-off-by: Adrian Reber <areber@redhat.com>
This way,
- Makefile is less cluttered;
- one can run codespell from the command line.
Fixes: fd7e97fcf ("lint: exclude tags file from codespell")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>