criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.
libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).
Signed-off-by: Adrian Reber <areber@redhat.com>
1. os auto assignment vma addr maybe conflict with vma in gpu living migrate scene;
2. so, we should give choice to user;
Signed-off-by: haozi007 <liuhao27@huawei.com>
Commit fc683cb01 ("compel: shstk: save CET state when CPU supports it")
started using PTRACE_ARCH_PRCTL to query shadow stack status. While
PTRACE_ARCH_PRCTL has existed in the kernel for a long time, it was only
added to glibc in version 2.27. Amazon Linux 2 (AL2) has glibc 2.26,
which does not have this definition. As a result, build on AL2 fails
with the below error:
compel/arch/x86/src/lib/infect.c: In function ‘get_task_xsave’:
compel/arch/x86/src/lib/infect.c:276:14: error: ‘PTRACE_ARCH_PRCTL’ undeclared (first use in this function)
276 | if (ptrace(PTRACE_ARCH_PRCTL, pid, (unsigned long)&features, ARCH_SHSTK_STATUS)) {
| ^~~~~~~~~~~~~~~~~
While the definition is present on the system via the kernel headers (in
asm/ptrace-abi.h) which can be reached by including linux/ptrace.h, the
comment in compel/include/uapi/ptrace.h says:
We'd want to include both sys/ptrace.h and linux/ptrace.h, hoping
that most definitions come from either one or another. Alas, on
Alpine/musl both files declare struct ptrace_peeksiginfo_args, so
there is no way they can be used together. Let's rely on libc one.
Since including linux/ptrace.h is not an option, define
PTRACE_ARCH_PRCTL if it doesn't already exist. An interesting point to
note is that in sys/ptrace.h, PTRACE_ARCH_PRCTL is an enum value so the
preprocessor doesn't know about it. PT_ARCH_PRCTL is the preprocessor
symbol that matches the value of PTRACE_ARCH_PRCTL. So look for
PT_ARCH_PRCTL to decide if PTRACE_ARCH_PRCTL is available or not.
Another interesting point to note is that AL2 ships with GCC 7 by
default, which does not support the -mshstk option, causing other build
failures. Luckily, it also ships GCC 10 which does have the option.
Using GCC 10 lets the build succeed.
Fixes: fc683cb01 ("compel: shstk: save CET state when CPU supports it")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
When calling sigreturn with CET enabled, the kernel verifies that the
shadow stack has proper address of sa_restorer and a "restore token".
Normally, they pushed to the shadow stack when signal processing is
started.
Since compel calls sigreturn directly, the shadow stack should be
updated to match the kernel expectations for sigreturn invocation.
Add parasite_setup_shstk() that sets up the shadow stack with the
address of __export_parasite_head_start as sa_restorer and with the
required restore token.
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
As our pr_* functions are complex and can call different system calls
inside before actual printing (e.g. gettimeofday for timestamps) actual
errno at the time of printing may be changed.
Let's just use %s + strerror(errno) instead of %m with pr_* functions to
be explicit that errno to string transformation happens before calling
anything else.
Note: tcp_repair_off is called from pie with no pr_perror defined due to
CR_NOGLIBC set and if I use errno variable there I get "Unexpected
undefined symbol: `__errno_location'. External symbol in PIE?", so it
seems there is no way to print errno there, so let's just skip it.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Starting the daemon is the first time we run code in the victim
using the parasite stack.
It's useful for testing to be able to infect the victim without starting
the daemon so that we can inspect the victim's state, set up stack
guards, and so on before stack-related corruption can happen.
Add compel_infect_no_daemon() to infect the victim but not start the
daemon and compel_start_daemon() to start the daemon after the victim
is infected.
Add compel_get_stack() to get the victim's main and thread parasite
stacks.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
When delivering system call traps, set bit 7 in the signal number (i.e.,
deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish
normal traps from those caused by a system call.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Add "get_rseq_conf" feature corresponding to the
ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support.
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
CRIU has a few places where it creates unix sockets and their names have to be
unique for each criu run.
Fixes: #1798
Signed-off-by: Andrei Vagin <avagin@google.com>
This is the workaround for #1429.
The parasite code contains instructions that trigger SIGTRAP to stop at
certain points. In such cases, the kernel sends a force SIGTRAP that
can't be ignore and if it is blocked, the kernel resets its signal
handler to a default one and unblocks it. It means that if we want to
save the origin signal handle
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Arch-dependend way to restore extended registers set.
Use it straight-away to restore per-thread registers.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Extended registers set for task is restored with rt_sigreturn() through
prepared sigframe. For threads it's currently lost.
Preserve it inside thread context to restore on thread curing.
Signed-off-by: Dmitry Safonov <dima@arista.com>
With pseudo-random garbage, the seed is printed with pr_err().
get_task_regs() is called during seizing the task and also for each
thread.
At this moment only for x86.
Signed-off-by: Dmitry Safonov <dima@arista.com>
To minimize things done in parasite, PTRACE_GET_THREAD_AREA can be
used to get remote tls. That also removes an additional compat stack
(de)allocation in the parasite (also asm-coded syscall).
In order to use PTRACE_GET_THREAD_AREA, the dumpee should be stopped.
So, let's move this from criu to compel to non-seized state and put tls
into thread info on x86.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Some kernels have W^X mitigation, which means they won't execute memory
blocks if that memory block is also writable or ever was writable. This
patch enables CRIU to run on such kernels.
1. Align .data section to a page.
2. mmap a memory block for parasite as RX.
3. mprotect everything after .text as RW.
Signed-off-by: Michał Cłapiński <mclapinski@google.com>
Previously, the GOT table was using the same memory location as the
args region, leading to difficult to debug memory corruption bugs.
We allocate the GOT table between the parasite blob and the args region.
The reason this is a good placement is:
1) Putting it after the args region is possible but a bit combersome as
the args region has a variable size
2) The cr-restore.c code maps the parasite code without the args region,
as it does not do RPC.
Another option is to rely on the linker to generate a GOT section, but I
failed to do so despite my best attempts.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Previously, __export_parasite_cmd was located in parasite-head.S, and
__export_parasite_args located exactly at the end of the parasite blob.
This is not ideal for various reasons:
1) These two variables work together. It would be preferrable to have
them in the same location
2) This prevent us from allocating another section betweeen the parasite
blob and the args area. We'll need this to allocate a GOT table
This commit changes the allocation of these symbols from assembly/linker
script to a C file.
Moreover, the assembly entry points that invoke parasite_service()
prepares arguments with hand crafted assembly. This is unecessary.
This commit rewrite this logic with regular C code.
Note: if it wasn't for the x86 compat mode, we could remove all
parasite-head.S files and directly jump to parasite_service() via
ptrace. An int3 architecture specific equivalent could be called at the
end of parasite_service() with an inline asm statement.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
It removes the potential confusion when it comes to virtual address vs
offsets. Further, doing so makes naming more consistent with the rest of
the parasite_blob_desc struct.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
compel_relocs_apply() was taking arguments mostly from the struct
parasite_blob_desc. Instead of passing all the arguments, we pass a
pointer to the struct itself.
This makes the code safer, as cr-restore.c calls compel_relocs_apply().
It previously needed to poke into what can be considered private
variables of the restorer-pie.h file.
To allow the parasite_blob_desc struct to be populated without a
parasite_ctl struct, we expand the compel API to export a
parasite_setup_c_header_desc() in the generated pie.h.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
The file only includes other headers (which may be not needed).
If we aim for one-include-for-compel, we could instead paste all
subheaders into "compel.h".
Rather, I think it's worth to migrate to more fine-grained compel
headers than follow the strategy 'one header to rule them all'.
Further, the header creates problems for cross-compilation: it's
included in files, those are used by host-compel. Which rightfully
confuses compiler/linker as host's definitions for fpu regs/other
platform details get drained into host's compel.
Signed-off-by: Dmitry Safonov <dima@arista.com>
All those compel functions can fail by various reasons.
It may be status of the system, interruption by user or anything else.
It's really desired to handle as many PIE related errors as possible
otherwise it's hard to analyze statuses of parasite/restorer
and the C/R process.
At least warning for logs should be produced or even C/R stopped.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
There are a few places where spaces have been used instead of tabs for
indentation. This patch converts the spaces to tabs for consistency
with the rest of the code base.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
with Android P's Clang versoin: 6.0.2, and Android NDK's Clang version 8.0.2
Clang will report below error:
criu/compel/include/uapi/compel/sigframe-common.h:55:34: error: expected member name or ';' after declaration specifiers
int __unused[32 - (sizeof (k_rtsigset_t) / sizeof (int))];
~~~ ^
it takes __unused as an attribute, not a varible, chang to _unused, pass compile.
Cc: Chen Hu <hu1.chen@intel.com>
Signed-off-by: Zhang Ning <ning.a.zhang@intel.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
We need to know what are stack pointers of every thread to ensure that the
current stack page will not be treated as lazy.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Alice Frosi <alice@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Will need them to mask some of the features from
command line options.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On fedora rawhide seccomp_metadata for some
reason is not defined (while in kernel it introduced
together with PTRACE_SECCOMP_GET_METADATA). So
lets do a trick for a while -- define own alias.
Once system headers get settled down we might find
more suitable solution. Because it's a part of kernel
API we're on the safe side.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will use it to figure out if filter log target is used.
Metadata associated with seccomp filter is relatively new
feature which allows userspace to get and set it back.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
For architectures like aarch64/ppc64 it's needed to propagate the size
of page inside PIEs. For the parasite page size will be defined during
seizing, and for restorer during early initialization.
Afterward we can use PAGE_SIZE in PIEs like we did before.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Ugh, I've spent 25 mins at 4 A.M. to figure out why the tests are failing.
And the reason is stupied me, who defined a new flag after 0x8
as 0x16, not as 0x10. Simplify those definitions for such simple-minded
living creatures like Dima.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On Skylake processors and kernel older than v4.14
ptrace(PTRACE_GETREGSET, pid, NT_X86_XSTATE, iov)
may return not full xstate, ommiting FP part (that is XFEATURE_MASK_FP).
There is a patch which describes this bug:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1318800.html
Anyway, it's fixed in v4.14 kernel by (what we believe with Andrey) this:
https://patchwork.kernel.org/patch/9567939/
As we still support kernels from v3.10 and newer, we need to have a
workaround for this kernel bug on Skylake CPUs.
Big thanks to Shlomi for the reports, the effort and for providing an
Amazon VM to test this. I wish more bug reporters were like you.
Reported-by: Shlomi Matichin <shlomi@binaris.com>
Provided-test-env: Shlomi Matichin <shlomi@binaris.com>
Investigated-with: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
get_task_regs() needs to know if it needs to use workaround
for a Skylake ptrace() bug. The next patch will introduce a
new flag for that.
I also thought about making 3 versions of get_task_regs() and
adding them to ictx->get_task_regs() depending on the flags..
But get_task_regs() is a private function and infect_ctx is
a uapi.. So, let's just pass context flags to get_task_regs().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
As we anyway define save_regs_t for other purposes,
use it in the function declaration.
To unify infect_ctx style, add make_sigframe_t.
Mere cleanup.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Some get_status() methods may allocate data, because
not all of the fields in /proc/[pid]/status file
have the fixed size. For example, NSpid, which
size may vary.
Introduce new method free_status() in counterweight
for such type get_status() methods. it will be called
in case of we go to try_again and need to free allocated
data.
Also, introduce data parameter for a use in the future.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The goal of this function is to compare everything except caps,
but caps size is took to compare. It's wrong, there must be
used offsetof(struct proc_status_creds, cap_inh) instead.
Also, sigpnd may be different too.
v3: Move excluding sigpnd from comparation in this patch (was in another patch).
Reorder fields in seize_task_status().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This naming is left from the first compatible kernel patches.
At that time to return to 32-bit task rt_sigreturn was used with
a special flag.
Now it's not true anymore, the naming doesn't relate.
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>