Currently each task subtracts number of zombies from
task_entries->nr_threads without locks, so if two tasks will do this
operation concurrently, the result may be unpredictable.
https://github.com/xemul/criu/issues/13
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The old and new address parameters passed to the mremap system
call must be page size aligned. On AArch64, the page size can
only be correctly determined at run time. This fixes the following
errors for CRIU on AArch64 kernels with CONFIG_ARM64_64K_PAGES=y.
call mremap(0x3ffb7d50000, 8192, 8192, MAYMOVE | FIXED, 0x2a000)
Error (rst-malloc.c:201): Can't mremap rst mem: Invalid argument
call mremap(0x3ffb7d90000, 8192, 8192, MAYMOVE | FIXED, 0x32000)
Error (rst-malloc.c:201): Can't mremap rst mem: Invalid argument
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This fixes the following error for CRIU on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y.
Error (cr-restore.c:2828): Can't mmap section for restore code
This occurred because the address being requested (0x16000 in
one case) was not page aligned.
Also change the capitalization of the pie_size() macro to make it
clear that the value is not necessarily a build-time constant.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When a TASK_HELPER would exit just before a zombie, sometimes the signal
would get coalesced, and we would miss the zombie exit, causing us to block
forever waiting for the zombie to complete. Let's use an entirely different
strategy for waiting on zombies: explicitly wait on them with waitid, and
use WNOWAIT to prevent their data from actually being reaped.
v2: don't decrement nr_{tasks,threads} in the loop
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
8ffbe754bd9 moved the rst_mem_lock() call, but didn't move the
corresponding LSM allocations, so we do that here.
One unfortunate thing is that we have to split this into two steps: first
we have to read the creds to figure out exactly how much memory to
allocate for the lsm string. Since prepare_creds() wants to write directly
to the task_restore_args struct and that can't be allocated until after we
lock the restore memory, we break it up into two steps.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
After we got the total remapable rst memory size, we no longer
can allocate from it, otherwise the bootstrap area will not
have enough size.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's similar to previous patch with tcp mem -- no need to
realloc big arrays and then memcpy data between them. It's
enough just to walk timerfd objects at the very end.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In current scheme we grow an array with realloc()-s then
memcpy() the result into rst_mem. I propose to get rid
or realloc-s (we already have objects for the data we
need to keep) and memcpy-s (and put objects directly
into rst_mem at the end).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Calling rst_mem_alloc() in a loop with increasing size causes the
n^2 memory grow :) since _alloc is not _realloc.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
linux/seccomp.h may not be available, and the seccomp mode might not be
listed in /proc/pid/status, so let's not assume those two things are
present.
v2: add a seccomp.h with all the constants we use from linux/seccomp.h
v3: don't do a compile time check for PTRACE_O_SUSPEND_SECCOMP, just let
ptrace return EINVAL for it; also add a checkskip to skip the
seccomp_strict test if PTRACE_O_SUSPEND_SECCOMP or linux/seccomp.h
aren't present.
v4: use criu check --feature instead of checkskip to check whether the
kernel supports seccomp_suspend
Reported-by: Mr. Jenkins
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since we don't support dumping per-thread creds, let's at least fail to
dump if the creds don't match.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Unfortunately, SECCOMP_MODE_FILTER is not currently exposed to userspace,
so we can't checkpoint that. In any case, this is what we need to do for
SECCOMP_MODE_STRICT, so let's do it.
This patch works by first disabling seccomp for any processes who are going
to have seccomp filters restored, then restoring the process (including the
seccomp filters), and finally resuming the seccomp filters before detaching
from the process.
v2 changes:
* update for kernel patch v2
* use protobuf enum for seccomp type
* don't parse /proc/pid/status twice
v3 changes:
* get rid of extra CR_STAGE_SECCOMP_SUSPEND stage
* only suspend seccomp in finalize_restore(), just before the unmap
* restore the (same) seccomp state in threads too; also add a note about
how this is slightly wrong, and that we should at least check for a
mismatch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Eric wants to restrict permissions for proc mounts in a non-root userns
according with proc mounts in the root userns.
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri May 8 23:49:47 2015 -0500
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
Ignore an existing mount if the locked readonly, nodev or atime
attributes are less permissive than the desired attributes
of the new mount.
...
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Instead of keeping around multiple fds that point to various places in
/proc, let's just use /proc and openat() things relative to it.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a little tricky, since the threads are forked in the restorer blob, we
can't open their attr/curent files to pass into the restorer blob. So, we pass
in an fd for /proc that the restorer blob can use to access the attr/current
files once they exist.
N.B. this is still incorrect in that it restores the same credentials for all
threads in the group; however, it matches the behavior of the current creds
restore code, which also restores the same creds for all threads in the group.
v2: use simple_sprintf() instead of pie_strcat()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Otherwise getting
| parasite-syscall.c: In function ‘parasite_infect_seized’:
| parasite-syscall.c:1222:5: error: ‘elf_relocs’ undeclared (first use in this function)
Simply wrap the @elf_relocs_apply with macros.
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
- Move relocs application into a separate file which get
compiled as a regular C file in criu (pie/pie-relocs.[ch])
- Move types used by piegen into pie/piegen/uapi/types.h
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
At moment both parasite and restorer do not have
any relocs because we support x86-64 only, but
this will be changed soon so do a call and apply
relocations.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case of @gotpcrel relocations we need additional
space to carry pointers.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we have several arrays of objects that get remapped
into pie area and their number is also passed. Clean and shorten
the remapping code a bit and bing their naming to common format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use SIG_SETMASK instead of SIG_BLOCKMASK here in case the parent had SIGCHLD
blocked. In this case if one of the criu threads has a problem, since the
SIGCHLD is blocked, the restore simply hangs.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch adds support for checkpoint and restore of two linux security
modules (apparmor and selinux). The actual checkpoint or restore code isn't
that interesting, other than that we have to do the LSM restore in the restorer
blob since it may block any number of things that we want to do as part of the
restore process.
I tried originally to get this to work using libraries in the restorer blob,
but I could _not_ get things to work correctly (I assume I was doing something
wrong with all the static linking, you can see my draft attempts here:
https://github.com/tych0/criu/commits/apparmor-using-libraries ). I can try to
resurrect this if it makes more sense, to do it that way, though.
v2: lsm_profile lives in creds.proto instead of the task core, look in a more
canonical place for selinuxfs and don't try to special case any selinux
profile names.
v3: only allow unconfined selinux profiles
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some architectures like ppc64 requires a trampoline to be called prior to
the standard restorer services.
This patch introduces 3 trampolines which can be overwritten by
architectures in arch/x/include/asm/restore.h:
- arch_export_restore_thread
- arch_export_restore_task
- arch_export_unmap
The architecture which doesn't need to overwrite them, has nothing to do.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch initiates the ppc64le architecture support in CRIU.
Note that ppc64 (Big Endian) architecture is not yet supported since there
are still several issues to address with this architecture. However, in the
long term, the two architectures should be addressed using the almost the
same code, so sharing the ppc64 directory.
Major ppc64 issues:
Loader is not involved when the parasite code is loaded. So no relocation
is done for the parasite code. As a consequence r2 must be set manually
when entering the parasite code, and GOT is not filled.
Furthermore, the r2 fixup code at the services's global address which has
not been fixed by the loader should not be run. Branching at local address,
as the assembly code does is jumping over it.
On the long term, relocation should be done when loading the parasite code.
We are introducing 2 trampolines for the 2 entry points of the restorer
blob. These entry points are dealing with r2. These ppc64 specific entry
points are overwritting the standard one in sigreturn_restore() from
cr-restore.c. Instead of using #ifdef, we may introduce a per arch wrapper
here.
CRIU needs 2 kernel patches to be run powerpc which are not yet upstream:
- Tracking the vDSO remapping
- Enabling the kcmp system call on powerpc
Feature not yet supported:
- Altivec registers C/R
- VSX registers C/R
- TM support
- all lot of things I missed..
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case of rst_sibling, we are tracing the root task, but we are tracing
only the leader thread, so we must attach to other threads to stop them.
Reported-by: Ross Boucher <rboucher@gmail.com>
Cc: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have two helpers for VMA type testing: privately_dump_vma() and vma_priv(). They
work with different types but basically do the same: check if we should dump VMA into
the image and restore it back then.
Lets unify they both into common vma_entry_is_private() helper and vma_area_is_private()
for working with vma_area type.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
*can* reads that it might fail but won't. Instead, this comment means that any
code added below it should not fail, because the network is unlocked and so the
restored process is exposed to the world.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to finalize the cg yard both on successful cgroup restore and on a
failed restore. Further, we should restore the cgroup properties before
allowing the task to continue in all modes (previously properties were only
restored correctly in --restore-detached mode).
CC: Saied Kazemi <saied@google.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If the --restore-detached command line option is not specified during
restore, CRIU should unmount and remove the temporary cgyard directory
tree before waiting for the restored process to exit. Otherwise, all
the temporary cgyard mount points will remain mounted and visible.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When an image of a certian type is not found, CRIU sometimes
fails, sometimes ignores this fact. I propose to ignore this
fact always and treat absent images and those containing no
objects inside (i.e. -- empty). If the latter code flow will
_need_ objects, then criu will fail later.
Why object will be explicitly required? For example, due to
restoring code reading the image with pb_read_one, w/o the
_eof suffix thus required the object to be in the image.
Another example is objects dependencies. E.g. fdinfo objects
require various files objects. So missing image files will
result in non-resolved searches later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When page-read fails to open the pagemap image it reports error.
One place (stacked page-reads) need to handle the absent images
case gracefully, so fix the return codes to make this check
work.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Current code doesn't make any difference between OPT and no-OPT
except for the message is printed or not in the open_image().
So this particular change changes nothing but the availability of
this message.
In the next patches I wil introduce "empty images" to deal with
the ENOENT situation in a more graceful manner.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
When restoring a pair of veth devices that had one end inside a namespace
or container and the other end outside, CRIU creates a new veth pair,
puts one end in the namespace/container, and names the other end from
what's specified in the --veth-pair IN=OUT command line option.
This patch allows for appending a bridge name to the OUT string in the
form of OUT@<BRIDGE-NAME> in order for CRIU to move the outside veth to
the named bridge. For example, --veth-pair eth0=veth1@br0 tells CRIU
to name the peer of eth0 veth1 and move it to bridge br0.
This is a simple and handy extension of the --veth-pair option that
obviates the need for an action script although one can still do the same
(and possibly more) if they prefer to use action scripts.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a nasty issue with it. Current code allocates these
entries in shremap area one by one. We do NOT allocate any
OTHER entries in this region, but if we will this array will
be spoiled.
Fortunately we no longer need shmem-infos as plain array,
neither we need one in restorer. So just turn this into plain
shared objects and collect them in a list.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>