There were multiple copy of the same code spread over the different
architectures handling the vDSO.
This patch is merging the duplicated code in arch/*/vdso-pie.c and
arch/*/include/asm/vdso.h in the common files and let only the architecture
specific part in the arch/*/* files.
The file are now organized this way:
include/asm-generic/vdso.h
contains basic definition which could be overwritten by
architectures.
arch/*/include/asm/vdso.h
contains per architecture definitions.
It may includes include/asm-generic/vdso.h
pie/util-vdso.c
include/util-vdso.h
These files contains code and definitions common to both criu and
the parasite code.
The file include/util-vdso.h includes arch/*/include/asm/vdso.h.
pie/parsite-vdso.c
include/parasite-vdso.h
contains code and definition specific to the parasite code handling
the vDSO.
The file include/parasite-vdso.h includes include/util-vdso.h.
arch/*/vdso-pie.c
contains the architecture specific code installing the vDSO
trampoline.
vdso.c
include/vdso.h
contains code and definition specific to the criu code handling the
vDSO.
The file include/vdso.h includes include/util-vdso.h.
CC: Christopher Covington <cov@codeaurora.org>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
To handle deleted bindmounts we simply create
the former directory bindmount lived at, mount
the target and remove the directory back.
For this sake we add @deleted entry into the image.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Make it handle both postfixes and return
non-zero code if stipping happened.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Useless, at least in the form present
now it's unreadable anyway. So stop
welling out the logs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's quite unclean while this structure lives
in proc_parse.h, which only have to fill this
structure on procfs read, but real handling
is inside mount.c. Move it as appropriate.
Same time ext_mount structure should be moved
into a header as well with sane @list name
used instead of @l.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For example we hit a case where systemd carries journal
file with 4M in size.
https://jira.sw.ru/browse/PSBM-38571
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Without using a freezer cgroup, we need to do a few iterations to catch
all tasks, because a new tasks can be born. If new tasks appear faster
than criu collects them, criu fails. The freezer cgroup allows to
solve this problem.
We freeze the freezer group, then attaches to tasks with ptrace and thaw
the freezer cgroup. We suppose that all tasks which are going to be
dumped in a specified freezer group.
v2: fix comments from Christopher
Reviewed-by: Christopher Covington <cov@codeaurora.org>
v3: refactor task_seize
v4: fix comments from Pavel
Cc: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required for dumping tmpfs, where we use tar to save content.
If we need to execute tar from a proper userns to get right uid-s and
gid-s for files.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is here only to support the Linux Kernel between versions
3.18 and 4.2. After that, this workaround is not needed anymore,
but it will work properly on both a kernel with and without the bug.
The bug is that when a process has a file open in an OverlayFS directory,
the information in /proc/<pid>/fd/<fd> and /proc/<pid>/fdinfo/<fd>
is wrong, so we grab that information from the mountinfo table instead.
This is done every time fill_fdlink is called.
We first check to see if the mnt_id and st_dev numbers currently match
some entry in the mountinfo table. If so, we already have the correct mnt_id
and no fixup is needed.
Then we proceed to see if there are any overlayFS mounted directories
in the mountinfo table. If so, we concatenate the mountpoint with the
name of the file, and stat the resulting path to check if we found the
correct device id and node number. If that is the case, we update the
mount id and link variables with the correct values.
Signed-off-by: Gabriel Guimaraes <gabriellimaguimaraes@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's preparation to use a freezer cgroup for freezing tasks.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded. Since
vma_area_is_private() is used by both restorer blob code and non
restorer blob code, which must use different variables for recording
the task size, make task_size a function argument and modify the call
sites accordingly. This fixes the following error on AArch64 kernels
with CONFIG_ARM64_64K_PAGES=y.
pie: Error (pie/restorer.c:929): Can't restore 0x3ffb7e70000 mapping w>
pie: ith 0xfffffffffffffff7
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded.
This fixes the following error on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y.
pie: Error (pie/restorer.c:772): Unable to unmap (-): -1211695104
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently each task subtracts number of zombies from
task_entries->nr_threads without locks, so if two tasks will do this
operation concurrently, the result may be unpredictable.
https://github.com/xemul/criu/issues/13
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
* Added functionality for dumping unnamed unix sockets.
When we call CRIU with dump option, for unnamed socket we
should pass it inode into --ext-unix-sk. Details about this problem
described in http://criu.org/External_UNIX_socket#What_to_do_with_socketpair.28.29-s.3F.
Usage example:
criu dump -D images -o dump.log -v4 --ext-unix-sk=4529709 -t 13506
* fix typo error in log output
Signed-off-by: Artem Kuzmitskiy <artem.kuzmitskiy@lge.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We use native cpuid, so this one is no longer used.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When a TASK_HELPER would exit just before a zombie, sometimes the signal
would get coalesced, and we would miss the zombie exit, causing us to block
forever waiting for the zombie to complete. Let's use an entirely different
strategy for waiting on zombies: explicitly wait on them with waitid, and
use WNOWAIT to prevent their data from actually being reaped.
v2: don't decrement nr_{tasks,threads} in the loop
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this in the next patch for collecting the zombies without
actually waiting on them.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In criu 1.6 if no --manage-cgroups option been specified
we still restore default (known) properties. But in commit
c7d646afb373 we've enhanced its semantics occasionally break
backward compatibility: if no --manage-cgroups passed at all
it's assumed that one asks to not touch cgroups at all on
restore. To restore old behaviour setup "soft" mode by
default.
Reported-by: Andrew Vagin <avagin@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
After we got the total remapable rst memory size, we no longer
can allocate from it, otherwise the bootstrap area will not
have enough size.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's similar to previous patch with tcp mem -- no need to
realloc big arrays and then memcpy data between them. It's
enough just to walk timerfd objects at the very end.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In current scheme we grow an array with realloc()-s then
memcpy() the result into rst_mem. I propose to get rid
or realloc-s (we already have objects for the data we
need to keep) and memcpy-s (and put objects directly
into rst_mem at the end).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
linux/seccomp.h may not be available, and the seccomp mode might not be
listed in /proc/pid/status, so let's not assume those two things are
present.
v2: add a seccomp.h with all the constants we use from linux/seccomp.h
v3: don't do a compile time check for PTRACE_O_SUSPEND_SECCOMP, just let
ptrace return EINVAL for it; also add a checkskip to skip the
seccomp_strict test if PTRACE_O_SUSPEND_SECCOMP or linux/seccomp.h
aren't present.
v4: use criu check --feature instead of checkskip to check whether the
kernel supports seccomp_suspend
Reported-by: Mr. Jenkins
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since we don't support dumping per-thread creds, let's at least fail to
dump if the creds don't match.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Unfortunately, SECCOMP_MODE_FILTER is not currently exposed to userspace,
so we can't checkpoint that. In any case, this is what we need to do for
SECCOMP_MODE_STRICT, so let's do it.
This patch works by first disabling seccomp for any processes who are going
to have seccomp filters restored, then restoring the process (including the
seccomp filters), and finally resuming the seccomp filters before detaching
from the process.
v2 changes:
* update for kernel patch v2
* use protobuf enum for seccomp type
* don't parse /proc/pid/status twice
v3 changes:
* get rid of extra CR_STAGE_SECCOMP_SUSPEND stage
* only suspend seccomp in finalize_restore(), just before the unmap
* restore the (same) seccomp state in threads too; also add a note about
how this is slightly wrong, and that we should at least check for a
mismatch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reasoning: some systems have /sys/fs/cgroup stuff mounted as read-only
and we have to either remount it rw or create our own set. The former
doesn't look sane as this rw remounting is also done by ststemd, so
let's return back to manual cgyard construction.
This reverts commit 860df95f859cf7ba23b57fc832793c623a5897e4.
Conflicts:
cgroup.c
include/cr_options.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Instead of keeping around multiple fds that point to various places in
/proc, let's just use /proc and openat() things relative to it.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a little tricky, since the threads are forked in the restorer blob, we
can't open their attr/curent files to pass into the restorer blob. So, we pass
in an fd for /proc that the restorer blob can use to access the attr/current
files once they exist.
N.B. this is still incorrect in that it restores the same credentials for all
threads in the group; however, it matches the behavior of the current creds
restore code, which also restores the same creds for all threads in the group.
v2: use simple_sprintf() instead of pie_strcat()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this in the next patch for printing paths to LSM files in /proc.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When been playing wich checkpoint/restore of container I found
that we can't reuse existing controller if they were pre-created.
For example currently in PCS7 we're bindmount cgroups which belong
to a container in a form of
/sys/fs/cgroup/<controller>/<container> ==> /sys/fs/cgroup/<controller>
so that CRIU dumps such configuration fine but on restore
it recreates controllers from the scratch which we would
like to bindmount them and ask CRIU to restore subcgroups
and their parameters.
So I extended --manage-cgroups option to take <mode> arguments.
Detailed description in docs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we always create temporary directory where we restore
cgroups, but this won't work in case if mounting cgroups is forbidden
from inside of a container for some reason (as in OpenVZ kernel).
So one can pass --cgroup-yard option to specify an existing
directory where cgroups are living. By default we assume it
lays in /sys/fs/cgroup.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On PPC64, the hard definition of TFD_IOC_SET_TICKS doesn't match the kernel
one.
We should use the _IOW based on to be more flexible here.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we have several arrays of objects that get remapped
into pie area and their number is also passed. Clean and shorten
the remapping code a bit and bing their naming to common format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Otherwise we eventually get compiler warnings, ending up with the
build abort.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We check files in /sys, so we must do this from host mount namespaces.
The write_img_inventory() is called after kerndat_init() and it's only
called on dump. The bug is triggered on restore, because the mount
namespace of the restored process doesn't have
/sys/kernel/security/apparmor/
I think it's better to initialize the host lsm in a one place for dump
and restore.
Currently we initialize the host lsm when we try to use it at a first
time. It works fine for the dump operation. On restore it doesn't work
because criu checks files in a restored mount namespace and it does this
for each process, what isn't optimal.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some entries might be missing and that should not cause
CRIU to stop dumping when we know the entries are safe
to unuse.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch adds support for checkpoint and restore of two linux security
modules (apparmor and selinux). The actual checkpoint or restore code isn't
that interesting, other than that we have to do the LSM restore in the restorer
blob since it may block any number of things that we want to do as part of the
restore process.
I tried originally to get this to work using libraries in the restorer blob,
but I could _not_ get things to work correctly (I assume I was doing something
wrong with all the static linking, you can see my draft attempts here:
https://github.com/tych0/criu/commits/apparmor-using-libraries ). I can try to
resurrect this if it makes more sense, to do it that way, though.
v2: lsm_profile lives in creds.proto instead of the task core, look in a more
canonical place for selinuxfs and don't try to special case any selinux
profile names.
v3: only allow unconfined selinux profiles
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some architectures like ppc64 requires a trampoline to be called prior to
the standard restorer services.
This patch introduces 3 trampolines which can be overwritten by
architectures in arch/x/include/asm/restore.h:
- arch_export_restore_thread
- arch_export_restore_task
- arch_export_unmap
The architecture which doesn't need to overwrite them, has nothing to do.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>