The restore of a task with shadow stack enabled adds these steps:
* switch from the default shadow stack to a temporary shadow stack
allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
CRIU mappings can be removed only after switching to a temporary
shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
There are several gotachs when restoring a task with shadow stack:
* depending on the compiler options, glibc version and glibc tunables
CRIU can run with or without shadow stack.
* shadow stack VMAs are special, they must be created using a dedicated
map_shadow_stack() system call and can be modified only by a special
instruction (wrss) that is only available when shadow stack is
enabled.
* once shadow stack is enabled, it is not writable even with wrss;
writes to shadow stack can be only enabled with ptrace() and only when
shadow stack is enabled in the tracee.
* if the shadow stack is enabled during restore rather than by glibc,
calling retq after arch_prctl() that enables the shadow stack causes
#CP, so the function that enables shadow stack can never return.
Add the infrastructure required to cope with all of those:
* modify the restore code to allow trampoline (arch_shstk_trampoline)
that will enable shadow stack and call restore_task_with_children().
* add call to arch_shstk_unlock() right after the tasks are clone()ed;
this will allow unlocking shadow stack features and making shadow
stack writable.
* add stubs for architectures that do not support shadow stacks
* add implementation of arch_shstk_trampoline() and arch_shstk_unlock()
for x86, but keep it disabled; it will be enabled along with addtion
of the code that will restore shadow stack in the restorer blob
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Detect if CRIU runs with shadow stack enabled and store the result in
kerndat.
Unlike most kerndat knobs, kdat_has_shstk() does not check for
availability of the shadow stack in the kernel, but rather checks if
criu runs with shadow stack enabled.
This depends on hardware availabilty, kernel and glibc support, compiler
options and glibc tunables, so kdat_has_shstk() must be called every
time CRIU starts and its result cannot be cached.
The result will be used by the code that controls shadow stack
enablement in the next commit.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Shadow stacks must be populated using special WRSS instruction. This
instruction is only available when shadow stack is enabled, calling it
with disabled shadow stack causes #UD.
Moreover, shadow stack VMAs cannot be mremap()ed and they must be
created using map_shadow_stack() system call. This requires delaying the
restore of shadow stacks to restorer blob after the CRIU mappings are
cleared.
Introduce rst_shstk_info structure to hold shadow stack parameters
required in the restorer blob and populate this structure in
arch_prepare_shstk() method.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Shadow stack VMAs cannot be mmap()ed, they must be created using
map_shadow_stack() system call and populated using special wrss
instruction available only when shadow stack is enabled.
Premap them to reserve virtual address space and populate it to have
there contents available for later copying after enabling shadow stack.
Along with the space required by shadow stack VMAs also reserve an extra
page that will be later used as a temporary shadow stack.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
The shadow stack VMAs require special care because they can only be
created and populated using special system calls.
Add VMA_AREA_SHSTK flag and set it for VMAs that are marked as "ss" in
/proc/pid/smaps
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.
In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.
For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.
In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.
Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
All architectures create on-stack structure for floating point save area
in compel_get_task_regs() if the caller passes NULL rather than a valid
pointer.
The only place that calls compel_get_task_regs() with NULL for floating
point save area is parasite_start_daemon() and it is simpler to define
this strucuture on stack of parasite_start_daemon().
The availability of floating point save data is required in
parasite_start_daemon() to detect shadow stack presence early during
parasite infection and will be used in later patches.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Currently we have checkpoint/restore support only of cgroup v2 threaded
controllers. Threads originating in cgroup v1 environments will be
restored to the main thread's cgroup. This change extends the support
for a cgroups v1.
Signed-off-by: Stepan Pieshkin <stepanpieshkin@google.com>
This patch fixes the following lint error:
scripts/criu-ns:219:16: E713 [*] Test for membership should be `not in`
The change in this patch is auto-generated with `ruff --fix`.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Ruff (https://github.com/astral-sh/ruff) is a Python linter
written in Rust, designed to replace Flake8. It is significantly
faster and actively maintained.
In addition to replacing flake8 with ruff, this patch also
creates separate makefile targets for ruff, shellcheck and
codespell, so that they can be tested independently.
RUFF_FLAGS can be used to specify options such as '--fix'.
Example:
make lint
make ruff RUFF_FLAGS=--fix
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch fixes the following flake8 error:
python3 -m flake8 --config=scripts/flake8.cfg lib/pycriu/images/pb2dict.py
lib/pycriu/images/pb2dict.py:361:43: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The commit introducing PAGE_IS_SOFT_DIRTY has not been merged
in kernel v6.7.x.
fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
https://github.com/torvalds/linux/commit/e6a9a2cbc13bf
As a result, CRIU fails with the following error:
Error (criu/pagemap-cache.c:199): pagemap-cache: PAGEMAP_SCAN: Invalid argument'
Error (criu/pagemap-cache.c:225): pagemap-cache: Failed to fill cache for 63 (400000-402000)'
This patch updates check_pagemap() in kerndat to check if PAGE_IS_SOFT_DIRTY is supported.
Fixes: #2334
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Refactor code used to Checkpoint DRM devices. Code is moved
into amdgpu_plugin_drm.c file which hosts various methods to
checkpoint and restore a workload.
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Add a new compilation unit to host symbols and methods that will be
needed to C&R DRM devices. Refactor code that indicates support for
C&R and checkpoints KFD and DRM devices
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
We already don't treat it as error in the plugin itself, but after
returning -1 from RESUME_DEVICES_LATE hook we print debug message in
criu about failed plugin, let's return 0 instead.
While on it let's replace ret to exit_code.
Fixes: a9cbdad76 ("plugin/amdgpu: Don't print error for "No such process" during resume")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During the late stages of restore, each process being resumed gets
an ioctl call to KFD_CRIU_OP_RESUME. If the process has no kfd
process info, this call with fail with -ESRCH. This is normal
behaviour, so we shouldn't print an error message for it.
Signed-off-by: David Francis <David.Francis@amd.com>
To improve readability, this patch changes the return type of
iptables_has_criu_jump_target() to a boolean, where 'true' indicates
that iptables has CRIU jump target and 'false' indicates otherwise.
Suggested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch removes a leftover declaration for log_closedir()
which has been removed in the following commit:
dc80d6f125e1e919363a0b8f938b8679ff0dbc2b
log: get rid of LOG_DIR_FD_OFF and opening cwd in log_init()
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Let's use hooked nft chain which actually affects packets.
Fixes: e5f4d8c6f ("test/nfconntrack: use nft or iptables-legacy")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
This change adds a new injectable fault (135) to disable PAGEMAP_SCAN and fault
back to read pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
PAGEMAP_SCAN is a new ioctl that allows to get page attributes in a more
effeciant way than reading pagemap files.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
In commit [1] was introduced a mechanism to auto-generate the files:
sys-exec-tbl*.c, syscalls*.S, syscall-codes*.h, and syscall*.h.
This commit also updated the gitignore rules to ignore auto-generated
files. However, after commit [2], the path for these files has changed
and the patterns specified in gitignore are no longer needed.
[1] bbc2f133 (x86/build: generate syscalls-{64,32}.built-in.o)
[2] 19fadee9 (compel: plugins,std -- Implement syscalls in std plugin)
Reported-by: @felicitia
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The image has a too old version of nettle which does not work with gnutls.
Just upgrade to the latest to make the error go away.
Signed-off-by: Adrian Reber <areber@redhat.com>
Newer versions of 'tail' rely on inotify and after a restore 'tail' is
unhappy with the state of inotify and just stops.
This replaces 'tail' with a minimal shell based test (thanks Andrei).
Signed-off-by: Adrian Reber <areber@redhat.com>
If ioctl(TIOCSLCKTRMIOS) fails with EPERM it means that a CRIU
process lacks of CAP_SYS_ADMIN capability. But we can use
ioctl(TIOCGLCKTRMIOS) to *read* current ->termios_locked
value from the kernel and if it's the same as we already have
we can skip failing ioctl(TIOCSLCKTRMIOS) safely.
Adrian has recently posted [1] a very good patch to allow ioctl(TIOCSLCKTRMIOS)
for processes that have CAP_CHECKPOINT_RESTORE (right now it requires CAP_SYS_ADMIN).
[1] https://lore.kernel.org/all/20231206134340.7093-1-areber@redhat.com/
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
WARNINGS variable should be amended, not redefined.
We still need, e.g., `-Wno-dangling-pointer` to build
criu on loongarch64 with gcc13.
Signed-off-by: Ivan A. Melnikov <iv@altlinux.org>
Checkpoint/restore with version 25.0.0-beta.1 fails
with the following error:
$ docker start --checkpoint=c1 cr
Error response from daemon: failed to create task for container: content digest fdb1054b00a8c07f08574ce52198c5501d1f552b6a5fb46105c688c70a9acb45: not found: unknown
Release notes:
https://github.com/moby/moby/discussions/46816
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
In the compel/arch/arm/plugins/std/syscalls/syscall.def, the syscall number of bind on ARM64 should be 200 instead of 235
Signed-off-by: Sally Kang <snapekang@gmail.com>
Two major highlights of this release:
* LoongArch64 support
* A lot of fixes and improvments form the Google backlog.
The full changelog can be found here: https://criu.org/Download/criu/3.19.
This marks the final release of the 3.x series. The upcoming version
will be 4.0! Additionally, the naming pattern will be changed. Any ideas
are welcome.
Signed-off-by: Andrei Vagin <avagin@gmail.com>