On arm
| CC crtools.o
| In file included from arch/arm/include/asm/bitops.h:4:0,
| from arch/arm/include/asm/types.h:9,
| from include/proc_parse.h:5,
| from include/ptrace.h:8,
| from cr-restore.c:27:
| cr-restore.c: In function 'restore_priv_vma_content':
| include/compiler.h:60:17: error: comparison of distinct pointer types lacks a cast [-Werror]
| (void) (&_min1 == &_min2); \
|
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When the VMA being restored is not COW-ed we read pages from images
one-by-one which results in suboptimal pages.img access. Fix this
by reading as many pages from iamge at once as possible withing the
active pagemap and VMA.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In order to restore seccomp filters, we need to have access to dynamically
allocated memory from the restorer blob, so we should unmap this memory
afterwards. In order to do this, we need to suspend seccomp earlier, right
after we attach to the tasks instead of just before we do the unmap of the
restorer blob itself.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This reverts commit 73cb87f9182bf46fceacde1e9023d8d5cdf99de6.
Two reasons: individual VMA-s may require different open flags
and ghost and link-remap files should be properly unlinked at
the end of open_path().
Need some more intelligent solution to this.
On restore we do a sequence of open+mmap+close steps. On real apps
there exists chains of private file mappings for the same file with
different pgoffs and/or flags/prots.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch(set) is inspired by similar from Andrey Vagin sent
sime time earlier.
The major idea is to artificially fail criu dump or restore at
specific places and let zdtm tests check whether failed dump
or restore resulted in anything bad.
This particular patch introduces the ability to tell criu "fail
at X point". Each point is specified with a integer constant
and with the next patches there will appear places over the
code checking for specific fail code being set and failing.
Two points are introduced -- early on dump, right after loading
the parasite and right after creation of the root task.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
1. As pointed out by Coverity (CID 114629), mnt_ns_fd is closed,
but then the function calls try_clean_remaps(mnt_ns_fd)
which tries to close the file descriptor which is already closed.
To address this, let's use safe_close() which sets closed fd to -1.
As it also checks its argument, there's no need for explicit check
so let's remove "if" check before close().
2. As Pavel pointed out, "calling the whole try_clean_remaps()
is not required once we've passed the cleanup_mnt_ns() point".
This could be addressed by introducing yet another label, but
it's cleaner to just use a flag variable.
Note that since the second issue is being addressed, the first one
goes away, but let's keep the fix for it anyway, it might help in
the future.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If criu restore failed, criu should wait all processes because they
hold files, namespaces and other stuff that caller might want to
have released (in our case it was ploop device).
Here we do this only for cases when processes are restored in a pid
namespace. We'd like to do the same for non-ns case, but there's
no simple way to wait for a bunch of unconnected processes.
Another good side effect is that "Restoring FAILED." will be printed
at the end of the log (now after we kill init tasks still have time
to do smth and write log messages).
Cc: Nikita Spiridonov <nspiridonov@odin.com>
Reported-by: Nikita Spiridonov <nspiridonov@odin.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This allows the user to perform actions before dumping or restoration
occurs.
Signed-off-by: Matthew Krafczyk <krafczyk.matthew@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
sys_sigaction() returns an error code
Reported-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Issue #18. When restore fails ghost files remain there. And
to remove them we have to know their list, paths to original
files (to construct the ghost name) and the namespace ghost
lives in.
For the latter we keep the restore task namespace at hands
till the final stage and setns into it to kill ghosts.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Info about ghosts presence and paths will be needed to
remove the ghosts itself and thus are needed in criu.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So here it is. If root task dies on restore the roots yard
dir remains unrmdired :( Since we already know its name, we
can remove one from criu. By the time we get to this place
the sub mount namespace(s) are already dead and yard dir
is empty. But umounting should be done by tasks after
successfull restore, so keep depopulation there.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Same thing as in previous patch -- we have too many generic
clean_ and fini_ prefixes over the code. And we need more (see
next patch), so let's specify what exactly we clean or fini.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There's already two things we do in criu namespaces before
forking the init task (start unsd and keep netnsfd for back
reference). Next patches will introduce the 3rd action for
mount namespaces, so have a special pre-call for all this
stuff.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded. While
trivial applications successfully checkpoint and restore on AArch64
kernels with CONFIG_ARM64_64K_PAGES=y without this patch, replacing
the remaining use of the hard-coded value seems like the best way to
guard against failures that more complex process trees and future uses
may expose.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded. Since
vma_area_is_private() is used by both restorer blob code and non
restorer blob code, which must use different variables for recording
the task size, make task_size a function argument and modify the call
sites accordingly. This fixes the following error on AArch64 kernels
with CONFIG_ARM64_64K_PAGES=y.
pie: Error (pie/restorer.c:929): Can't restore 0x3ffb7e70000 mapping w>
pie: ith 0xfffffffffffffff7
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we want one CRIU binary to work across all AArch64 kernel
configurations, a single task size value cannot be hard coded.
This fixes the following error on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y.
pie: Error (pie/restorer.c:772): Unable to unmap (-): -1211695104
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently each task subtracts number of zombies from
task_entries->nr_threads without locks, so if two tasks will do this
operation concurrently, the result may be unpredictable.
https://github.com/xemul/criu/issues/13
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The old and new address parameters passed to the mremap system
call must be page size aligned. On AArch64, the page size can
only be correctly determined at run time. This fixes the following
errors for CRIU on AArch64 kernels with CONFIG_ARM64_64K_PAGES=y.
call mremap(0x3ffb7d50000, 8192, 8192, MAYMOVE | FIXED, 0x2a000)
Error (rst-malloc.c:201): Can't mremap rst mem: Invalid argument
call mremap(0x3ffb7d90000, 8192, 8192, MAYMOVE | FIXED, 0x32000)
Error (rst-malloc.c:201): Can't mremap rst mem: Invalid argument
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This fixes the following error for CRIU on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y.
Error (cr-restore.c:2828): Can't mmap section for restore code
This occurred because the address being requested (0x16000 in
one case) was not page aligned.
Also change the capitalization of the pie_size() macro to make it
clear that the value is not necessarily a build-time constant.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When a TASK_HELPER would exit just before a zombie, sometimes the signal
would get coalesced, and we would miss the zombie exit, causing us to block
forever waiting for the zombie to complete. Let's use an entirely different
strategy for waiting on zombies: explicitly wait on them with waitid, and
use WNOWAIT to prevent their data from actually being reaped.
v2: don't decrement nr_{tasks,threads} in the loop
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
8ffbe754bd9 moved the rst_mem_lock() call, but didn't move the
corresponding LSM allocations, so we do that here.
One unfortunate thing is that we have to split this into two steps: first
we have to read the creds to figure out exactly how much memory to
allocate for the lsm string. Since prepare_creds() wants to write directly
to the task_restore_args struct and that can't be allocated until after we
lock the restore memory, we break it up into two steps.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
After we got the total remapable rst memory size, we no longer
can allocate from it, otherwise the bootstrap area will not
have enough size.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's similar to previous patch with tcp mem -- no need to
realloc big arrays and then memcpy data between them. It's
enough just to walk timerfd objects at the very end.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In current scheme we grow an array with realloc()-s then
memcpy() the result into rst_mem. I propose to get rid
or realloc-s (we already have objects for the data we
need to keep) and memcpy-s (and put objects directly
into rst_mem at the end).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Calling rst_mem_alloc() in a loop with increasing size causes the
n^2 memory grow :) since _alloc is not _realloc.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
linux/seccomp.h may not be available, and the seccomp mode might not be
listed in /proc/pid/status, so let's not assume those two things are
present.
v2: add a seccomp.h with all the constants we use from linux/seccomp.h
v3: don't do a compile time check for PTRACE_O_SUSPEND_SECCOMP, just let
ptrace return EINVAL for it; also add a checkskip to skip the
seccomp_strict test if PTRACE_O_SUSPEND_SECCOMP or linux/seccomp.h
aren't present.
v4: use criu check --feature instead of checkskip to check whether the
kernel supports seccomp_suspend
Reported-by: Mr. Jenkins
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since we don't support dumping per-thread creds, let's at least fail to
dump if the creds don't match.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Unfortunately, SECCOMP_MODE_FILTER is not currently exposed to userspace,
so we can't checkpoint that. In any case, this is what we need to do for
SECCOMP_MODE_STRICT, so let's do it.
This patch works by first disabling seccomp for any processes who are going
to have seccomp filters restored, then restoring the process (including the
seccomp filters), and finally resuming the seccomp filters before detaching
from the process.
v2 changes:
* update for kernel patch v2
* use protobuf enum for seccomp type
* don't parse /proc/pid/status twice
v3 changes:
* get rid of extra CR_STAGE_SECCOMP_SUSPEND stage
* only suspend seccomp in finalize_restore(), just before the unmap
* restore the (same) seccomp state in threads too; also add a note about
how this is slightly wrong, and that we should at least check for a
mismatch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Eric wants to restrict permissions for proc mounts in a non-root userns
according with proc mounts in the root userns.
Author: Eric W. Biederman <ebiederm@xmission.com>
Date: Fri May 8 23:49:47 2015 -0500
mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
Ignore an existing mount if the locked readonly, nodev or atime
attributes are less permissive than the desired attributes
of the new mount.
...
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Instead of keeping around multiple fds that point to various places in
/proc, let's just use /proc and openat() things relative to it.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a little tricky, since the threads are forked in the restorer blob, we
can't open their attr/curent files to pass into the restorer blob. So, we pass
in an fd for /proc that the restorer blob can use to access the attr/current
files once they exist.
N.B. this is still incorrect in that it restores the same credentials for all
threads in the group; however, it matches the behavior of the current creds
restore code, which also restores the same creds for all threads in the group.
v2: use simple_sprintf() instead of pie_strcat()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Otherwise getting
| parasite-syscall.c: In function ‘parasite_infect_seized’:
| parasite-syscall.c:1222:5: error: ‘elf_relocs’ undeclared (first use in this function)
Simply wrap the @elf_relocs_apply with macros.
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
- Move relocs application into a separate file which get
compiled as a regular C file in criu (pie/pie-relocs.[ch])
- Move types used by piegen into pie/piegen/uapi/types.h
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
At moment both parasite and restorer do not have
any relocs because we support x86-64 only, but
this will be changed soon so do a call and apply
relocations.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case of @gotpcrel relocations we need additional
space to carry pointers.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we have several arrays of objects that get remapped
into pie area and their number is also passed. Clean and shorten
the remapping code a bit and bing their naming to common format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use SIG_SETMASK instead of SIG_BLOCKMASK here in case the parent had SIGCHLD
blocked. In this case if one of the criu threads has a problem, since the
SIGCHLD is blocked, the restore simply hangs.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>