After the CRIU process saves the parasite code for the target thread in
the shared mmap, it is necessary to call __clear_cache before the target
thread executes the code.
Without this step, the target thread may not see the correct code to
execute, which can result in a SIGILL signal.
For the specific arm64 case. this is important so that the newly copied
code is flushed from d-cache to RAM, so that the target thread sees the
new code.
The change is based on commit 6be10a2 by @fu.lin and on input received
from @adrianreber.
[ avagin: tweak code comment ]
Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The stack test incorrectly assumed the page immediately
following the stack pointer could never be changed. This doesn't work,
because this page can be a part of another mapping.
This commit introduces a dedicated "stack redzone," a small guard region
directly after the stack. The stack test is modified to specifically
check for corruption within this redzone.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
PAC stands for Pointer Authentication Code. Each process has 5 PAC keys
and a mask of enabled keys. All this properties have to be C/R-ed.
As they are per-process protperties, we can save/restore them just for
one thread.
Signed-off-by: Andrei Vagin <avagin@google.com>
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.
libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).
Signed-off-by: Adrian Reber <areber@redhat.com>
We need to dynamically calculate TASK_SIZE depending
on the MMU on RISC-V system. [We are using analogical
approach on aarch64/ppc64le.]
This change was tested on physical machine:
StarFive VisionFive 2
isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
mmu : sv39
uarch : sifive,u74-mc
mvendorid : 0x489
marchid : 0x8000000000000007
mimpid : 0x4210427
hart isa : rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
We don't need to have compel/arch/riscv64/plugins/std/syscalls/syscalls.S
tracked in git. It is autogenerated. We also need to update our .gitignore
to ignore autogenerated files with syscall tables.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;
Signed-off-by: haozi007 <liuhao27@huawei.com>
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:
compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
276 | if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
| ^~~~~~~~~~~~~~~~~
While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:
We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
that most definitions come from either one or another. Alas, on
Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
there is no way they can be used together. Let's rely on libc one.
Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.
Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.
Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
The restore of a task with shadow stack enabled adds these steps:
* switch from the default shadow stack to a temporary shadow stack
allocated in the premmaped area
* unmap CRIU mappings; nothing changed here, but it's important that
CRIU mappings can be removed only after switching to a temporary
shadow stack
* create shadow stack VMA with map_shadow_stack()
* restore shadow stack contents with wrss
* switch to "real" shadow stack
* lock shadow stack features
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
To support sigreturn with CET enabled parasite must rewind its stack
before calling sigreturn so that shadow stack will be compatible with
actual calling sequence.
In addition, calling sigreturn from top level routine
(__export_parasite_head_start) will significantly simplify the shadow
stack manipulations required to execute sigreturn.
For x86 make fini_sigreturn() return the stack pointer for the signal
frame that will be used by sigreturn and propagate that return value up
to __export_parasite_head_start.
In non-daemon mode parasite_trap_cmd() returns non-positive value
which allows to distinguish daemon and non-daemon mode and properly stop
at int3 in non-daemon mode.
Architectures other than x86 remain unchanged and will still call
sigreturn from fini_sigreturn().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
All architectures create on-stack structure for floating point save area
in compel_get_task_regs() if the caller passes NULL rather than a valid
pointer.
The only place that calls compel_get_task_regs() with NULL for floating
point save area is parasite_start_daemon() and it is simpler to define
this strucuture on stack of parasite_start_daemon().
The availability of floating point save data is required in
parasite_start_daemon() to detect shadow stack presence early during
parasite infection and will be used in later patches.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
In the compel/arch/arm/plugins/std/syscalls/syscall.def, the syscall number of bind on ARM64 should be 200 instead of 235
Signed-off-by: Sally Kang <snapekang@gmail.com>
Currently page_size() returns unsigned int value that is after "bitwise
not" is promoted to unsigned long value e.g. in uffd.c
handle_page_fault. Since the value is unsigned promotion is done with 0
MSB that results in lost of MSB pagefault address bits. So make
page_size to return unsigned long to avoid such situation.
Signed-off-by: Vladislav Khmelevsky <och95@yandex.ru>
Power ISA 3.0 added a new syscall instruction. Kernel 5.9 added
corresponding support.
Add CRIU support to recognize the new instruction and kernel ABI changes
to properly dump and restore threads executing in syscalls. Without this
change threads executing in syscalls using the scv instruction will not
be restored to re-execute the syscall, they will be restored to execute
the following instruction and will return unexpected error codes
(ERESTARTSYS, etc) to user code.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The 288d6a61e29d change broke all the syscall numbers.
Reported-by: Michał Mirosław <emmir@google.com>
Fixes: (288d6a61e29d "loongarch64: reformat syscall_64.tbl for 8-wide tabs")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Note: Silently drops MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED as it's
not currently detectable. This is still better than silently dropping
all membarrier() registrations.
Signed-off-by: Michał Mirosław <emmir@google.com>
This commit revises the error handling in the fdspy test. Previously,
a failure case could have been incorrectly reported as successful because
of a specific check `pass != 0`, leading to potential false positives
when `check_pipe_ends()` returned `-1` due to a read/write pipe error.
To improve this, we've adjusted the error handling to return `0` in case
of any error. As such, the final success condition remains unchanged. This
approach will help accurately differentiate between successful and failed
cases, ensuring the output "All OK" is printed for success, and "Something
went WRONG" for any failure.
Fixes: 5364ca3 ("compel/test: Fix warn_unused_result")
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
Apparently Skylake uses init-optimization when saving FPU state, and ptrace()
returns XSTATE_BV[0] = 0 meaning FPU was not used by a task (in init state).
Since CRIU restore uses sigreturn to restore registers, FPU state is always
restored. Fill the state with default values on dump to make restore happy.
Signed-off-by: Michał Mirosław <emmir@google.com>
Newer Intel CPUs (Sapphire Rapids) have a much larger xsave area than
before. Looking at older CPUs I see 2440 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00000988 (2440)
On newer CPUs (Sapphire Rapids) it grows to 11008 bytes.
# cpuid -1 -l 0xd -s 0
...
bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008)
This increase the xsave area from one page to four pages.
Without this patch the fpu03 test fails, with this patch it works again.
Signed-off-by: Adrian Reber <areber@redhat.com>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.
Let's just use %s + strerror(errno) instead of %m with pr_* functions to
be explicit that errno to string transformation happens before calling
anything else.
Note: tcp_repair_off is called from pie with no pr_perror defined due to
CR_NOGLIBC set and if I use errno variable there I get "Unexpected
undefined symbol: `__errno_location'. External symbol in PIE?", so it
seems there is no way to print errno there, so let's just skip it.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The ppc64le ABI allows functions to store data in caller frames.
When initializing the stack pointer prior to executing parasite code
we need to pre-allocating the minimum sized stack frame before
jumping to the parasite code.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Some ABIs allow functions to store data in caller frame, which
means that we have to allocate an initial stack frame before
executing code on the parasite stack.
This test saves the contents of writable memory that follows the stack
after the victim has been infected but before we start using the
parasite stack. It later checks that the saved data matches the
current contents of the two memory areas. This is done while the
victim is halted so we expect a match unless executing parasite code
caused memory corruption. The test doesn't detect cases where we
corrupted memory by writing the same value.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
return zero on chk success
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Starting the daemon is the first time we run code in the victim
using the parasite stack.
It's useful for testing to be able to infect the victim without starting
the daemon so that we can inspect the victim's state, set up stack
guards, and so on before stack-related corruption can happen.
Add compel_infect_no_daemon() to infect the victim but not start the
daemon and compel_start_daemon() to start the daemon after the victim
is infected.
Add compel_get_stack() to get the victim's main and thread parasite
stacks.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
The x86 implement hardware breakpoint to accelerate the tracing syscall
procedure instead of `ptrace(PTRACE_SYSCALL)`. The arm64 has the same
capability according to <<Learn the architecture: Armv8-A self-hosted
debug>>[[1]].
<<Arm Architecture Reference Manual for A-profile architecture>[[2]]
illustrates the usage detailly:
- D2.8 Breakpoint Instruction exceptions
- D2.9 Breakpoint exceptions
- D13.3.2 DBGBCR<n>_EL1, Debug Breakpoint Control Registers, n
Note:
[1]: https://developer.arm.com/documentation/102120/0100
[2]: https://developer.arm.com/documentation/ddi0487/latest
Signed-off-by: fu.lin <fulin10@huawei.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Breakpoints are used to stop as close as possible to a target system call.
First, we don't need it after this point.
Second, PTRACE_CONT can't pass through a breakpoint on arm64.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
When delivering system call traps, set bit 7 in the signal number (i.e.,
deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Error from:
./test/zdtm.py run -t zdtm/static/fpu00 --fault 134 -f h --norst
(00.003111) Dumping GP/FPU registers for 56
(00.003121) Error (compel/arch/x86/src/lib/infect.c:310): Corrupting fpuregs for 56, seed 1651766595
(00.003125) Error (compel/arch/x86/src/lib/infect.c:314): Can't set FPU registers for 56: Invalid argument
(00.003129) Error (compel/src/lib/infect.c:688): Can't obtain regs for thread 56
(00.003174) Error (criu/cr-dump.c:1564): Can't infect (pid: 56) with parasite
See also:
145e9e0d8c6 ("x86/fpu: Fail ptrace() requests that try to set invalid MXCSR values")
145e9e0d8c
We decided to move from mxcsr cleaning up scheme and use mxcsr mask
(0x0000ffbf) as kernel does. Thanks to Dmitry Safonov for pointing out.
Tested-on: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz
Reported-by: Mr. Jenkins
Suggested-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Add "get_rseq_conf" feature corresponding to the
ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Those that codespell have a few variants for:
./soccr/soccr.c:219: thise ==> these, this
./soccr/soccr.c:444: sence ==> sense, since
./criu/net.c:665: ot ==> to, of, or
./criu/net.c:775: ot ==> to, of, or
./criu/files.c:1244: wan't ==> want, wasn't
./criu/kerndat.c:1141: happend ==> happened, happens, happen
./criu/mount-v2.c:781: carefull ==> careful, carefully
./test/zdtm/static/socket_aio.c:54: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen6.c:73: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen.c:73: Chiled ==> Child, chilled
./test/zdtm/static/socket_listen4v6.c:73: Chiled ==> Child, chilled
./test/zdtm/static/sk-unix-dgram-ghost.c:201: childs ==> children, child's
./test/zdtm/static/sk-unix-dgram-ghost.c:205: childs ==> children, child's
./compel/arch/x86/src/lib/infect.c:297: automatical ==> automatically, automatic, automated
While at it, do some other minor fixes in the same lines.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I am not sure if this is going to bring any compatibility issues.
If yes, we need to remove this patch and add "useable" to the list of
ignored words instead.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
CRIU has a few places where it creates unix sockets and their names have to be
unique for each criu run.
Fixes: #1798
Signed-off-by: Andrei Vagin <avagin@google.com>