In case of @gotpcrel relocations we need additional
space to carry pointers.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we have several arrays of objects that get remapped
into pie area and their number is also passed. Clean and shorten
the remapping code a bit and bing their naming to common format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use SIG_SETMASK instead of SIG_BLOCKMASK here in case the parent had SIGCHLD
blocked. In this case if one of the criu threads has a problem, since the
SIGCHLD is blocked, the restore simply hangs.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch adds support for checkpoint and restore of two linux security
modules (apparmor and selinux). The actual checkpoint or restore code isn't
that interesting, other than that we have to do the LSM restore in the restorer
blob since it may block any number of things that we want to do as part of the
restore process.
I tried originally to get this to work using libraries in the restorer blob,
but I could _not_ get things to work correctly (I assume I was doing something
wrong with all the static linking, you can see my draft attempts here:
https://github.com/tych0/criu/commits/apparmor-using-libraries ). I can try to
resurrect this if it makes more sense, to do it that way, though.
v2: lsm_profile lives in creds.proto instead of the task core, look in a more
canonical place for selinuxfs and don't try to special case any selinux
profile names.
v3: only allow unconfined selinux profiles
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some architectures like ppc64 requires a trampoline to be called prior to
the standard restorer services.
This patch introduces 3 trampolines which can be overwritten by
architectures in arch/x/include/asm/restore.h:
- arch_export_restore_thread
- arch_export_restore_task
- arch_export_unmap
The architecture which doesn't need to overwrite them, has nothing to do.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch initiates the ppc64le architecture support in CRIU.
Note that ppc64 (Big Endian) architecture is not yet supported since there
are still several issues to address with this architecture. However, in the
long term, the two architectures should be addressed using the almost the
same code, so sharing the ppc64 directory.
Major ppc64 issues:
Loader is not involved when the parasite code is loaded. So no relocation
is done for the parasite code. As a consequence r2 must be set manually
when entering the parasite code, and GOT is not filled.
Furthermore, the r2 fixup code at the services's global address which has
not been fixed by the loader should not be run. Branching at local address,
as the assembly code does is jumping over it.
On the long term, relocation should be done when loading the parasite code.
We are introducing 2 trampolines for the 2 entry points of the restorer
blob. These entry points are dealing with r2. These ppc64 specific entry
points are overwritting the standard one in sigreturn_restore() from
cr-restore.c. Instead of using #ifdef, we may introduce a per arch wrapper
here.
CRIU needs 2 kernel patches to be run powerpc which are not yet upstream:
- Tracking the vDSO remapping
- Enabling the kcmp system call on powerpc
Feature not yet supported:
- Altivec registers C/R
- VSX registers C/R
- TM support
- all lot of things I missed..
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case of rst_sibling, we are tracing the root task, but we are tracing
only the leader thread, so we must attach to other threads to stop them.
Reported-by: Ross Boucher <rboucher@gmail.com>
Cc: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have two helpers for VMA type testing: privately_dump_vma() and vma_priv(). They
work with different types but basically do the same: check if we should dump VMA into
the image and restore it back then.
Lets unify they both into common vma_entry_is_private() helper and vma_area_is_private()
for working with vma_area type.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
*can* reads that it might fail but won't. Instead, this comment means that any
code added below it should not fail, because the network is unlocked and so the
restored process is exposed to the world.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to finalize the cg yard both on successful cgroup restore and on a
failed restore. Further, we should restore the cgroup properties before
allowing the task to continue in all modes (previously properties were only
restored correctly in --restore-detached mode).
CC: Saied Kazemi <saied@google.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If the --restore-detached command line option is not specified during
restore, CRIU should unmount and remove the temporary cgyard directory
tree before waiting for the restored process to exit. Otherwise, all
the temporary cgyard mount points will remain mounted and visible.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When an image of a certian type is not found, CRIU sometimes
fails, sometimes ignores this fact. I propose to ignore this
fact always and treat absent images and those containing no
objects inside (i.e. -- empty). If the latter code flow will
_need_ objects, then criu will fail later.
Why object will be explicitly required? For example, due to
restoring code reading the image with pb_read_one, w/o the
_eof suffix thus required the object to be in the image.
Another example is objects dependencies. E.g. fdinfo objects
require various files objects. So missing image files will
result in non-resolved searches later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When page-read fails to open the pagemap image it reports error.
One place (stacked page-reads) need to handle the absent images
case gracefully, so fix the return codes to make this check
work.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Current code doesn't make any difference between OPT and no-OPT
except for the message is printed or not in the open_image().
So this particular change changes nothing but the availability of
this message.
In the next patches I wil introduce "empty images" to deal with
the ENOENT situation in a more graceful manner.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
When restoring a pair of veth devices that had one end inside a namespace
or container and the other end outside, CRIU creates a new veth pair,
puts one end in the namespace/container, and names the other end from
what's specified in the --veth-pair IN=OUT command line option.
This patch allows for appending a bridge name to the OUT string in the
form of OUT@<BRIDGE-NAME> in order for CRIU to move the outside veth to
the named bridge. For example, --veth-pair eth0=veth1@br0 tells CRIU
to name the peer of eth0 veth1 and move it to bridge br0.
This is a simple and handy extension of the --veth-pair option that
obviates the need for an action script although one can still do the same
(and possibly more) if they prefer to use action scripts.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a nasty issue with it. Current code allocates these
entries in shremap area one by one. We do NOT allocate any
OTHER entries in this region, but if we will this array will
be spoiled.
Fortunately we no longer need shmem-infos as plain array,
neither we need one in restorer. So just turn this into plain
shared objects and collect them in a list.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this mode we test if target cpu has all features present
in image file but do not require bit to bit match: target cpu
may be a new one with more features present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Restoring AIO is quite simple. Once all VMAs are put in
their places we can call io_setup() to let kernel create
the context back and then move the ring into proper place.
Another thing we should "restore" is the context ID. But
the thing is, upon ring creation kernel repots the ring
start address as this ID. And there's a patch in the -next
tree that changes the ID when we remap the ring. That
said after AIO context creation and ring remap we need
to check that the new ID is seen by the kernel.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a very common error when using criu.
The problem here is that we need to somehow transfer cr_errno
from one process to another. I suggest using pipe to give
one end to children and read cr_errno on other after restore
is finished.
v2, Pavel suggested putting errno into shared task_entries.
v3. and he also suggested using cmpxchg
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are cases where a process's file descriptor cannot be restored
from the checkpoint images. For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.
There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time. For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.
In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option. The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.
As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d. This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.
It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited. So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.
For examples please visit http://criu.org/Category:HOWTO.
v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
Also, as an extra measure of caution, added checks in the inherit fd
look up functions to make sure that the inherit fd hasn't been reused.
The patch also includes minor cosmetic changes.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this patch we fill /proc/PID/uid_map and /proc/PID/gid_map for the
root task.
v2: initialize groups in a new namespace.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
v3: add a helper to initialize creds in a new userns
v4: initialize userns creds in prepare_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some kernel modules such as pktgen runs kthred upon
new-net creation taking last_pid we were requested.
Lets workaround this problem using clone + unshare
bundle.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
TASK_HELPERs are created with CLONE_FILES, so if we always close the cg yard
here, it will close it for the other helpers and cause problems. Instead, we
close it much later, in code only called by alive tasks, to ensure that there
is no conflict.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On Wed, Oct 01, 2014 at 04:57:40PM +0400, Pavel Emelyanov wrote:
> On 10/01/2014 01:07 AM, Cyrill Gorcunov wrote:
> > On Tue, Sep 30, 2014 at 09:18:53PM +0400, Cyrill Gorcunov wrote:
> >> If a user requested criu to dump cpuinfo image then we
> >> write one on dump and verify on restore. At the moment
> >> we require all cpu feature bits to match the destination
> >> cpu in a sake of simplicity, but in future we need deps
> >> engine which would filer out bits and test if cpu we're
> >> restoring on is more capable than one we were dumping at
> >> allowing to proceed restore procedure.
> >>
> >> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> >
> > Updated to new img format
Something like attached?
>From 59272a9514311e6736cddee08d5f88aa95d49189 Mon Sep 17 00:00:00 2001
From: Cyrill Gorcunov <gorcunov@openvz.org>
Date: Thu, 25 Sep 2014 16:04:10 +0400
Subject: [PATCH] cpuinfo: x86 -- Add dump and validation of cpuinfo image
If a user requested criu to dump cpuinfo image then we
write one on dump and verify on restore. At the moment
we require all cpu feature bits to match the destination
cpu in a sake of simplicity, but in future we need deps
engine which would filer out bits and test if cpu we're
restoring on is more capable than one we were dumping at
allowing to proceed restore procedure.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We want to have buffered images to speed up dump and,
slightly, restore. Right now we use plan file descriptors
to write and read images to/from. Making them buffered
cannot be gracefully done on plain fds, so introduce
a new class.
This will also help if (when?) we will want to do more
complex changes with images, e.g. store them all in one
file or send them directly to the network.
For now the cr_img just contains one int _fd variable.
This patch chages the prototype of open_image() to
return struct cr_img *, pb_(read|write)* to accept one
and fixes the compilation of the rest of the code :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
The same -- int-fd will soon go away, so return the
explicit int -1 instead of it.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
There will be no int-fd soon, so one more preparation
to this fact.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
The write_img_buf will be used only for images writing, while
in this place we just have a raw file descriptor.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Unfortunately the kernel doesn't flush hw breakpoints on
detaching ptrace. If a breakpoint is triggered without ptrace, it
will be killed by SIGTRAP.
Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore parasite_stop_on_syscall() can be called after PTRACE_SYSCALL
and after a breakpoint. parasite_stop_on_syscall() must be called only
after PTRACE_SYSCALL, so all tests where is one process stuck.
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().
We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If a file like /proc/20/mountinfo is open, but 20 is a zombie (or doesn't exist
any more), we can't read this file at all, so a link remap won't work. Instead,
we add a new remap, called the dead process remap, which forks a TASK_HELPER as
that dead pid so that the restore task can open the new /proc/20/mountinfo
instead.
This commit also adds a new stage CR_STATE_RESTORE_SHARED. Since new
TASK_HELPERS are added when loading the shared resource images, we need to wait
to start forking tasks until after these resources are loaded.
v2: fix a mutex bug
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In order to use TASK_HELPERS to open files from dead processes, they should
persist until criu is done restoring the filesystem, which happens in the
RESTORE stage. To do this, we need to pass each helper's PIDs to the restorer
blob, so that it can wait() on them when the restore stage is done.
This commit is in preparation for the remap_dead_pid commits.
v2: wait() on helpers after restore stage is over
v3: add CR_STATE_RESTORE_FS stage
v4: CR_STATE_RESTORE_FS waits for nr_tasks + nr_helpers, not nr_threads
v5: ditch CR_STATE_RESTORE_FS in favor of passing helpers to restorer blob
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When clone-ing kids we can set their stack on current, as
it will anyway be COW-ed later. One thing to note -- we do
need to reserve some space on the stack for glibc's arguments
and retcode allocation. 128 bytes should be enough for 16
pointers while clone has 5 arguments.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In cr_dump_tasks() we expect restore_root_task to return < 0 if
error ocures.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>