Also define some constants for people who don't have them in their headers.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required to check the SIGSTOP signal, which can't be blocked.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So we keep it and dont close inside close_old_fds()
helper but pass into veth creation so the kernel
can fetch the net namespace of the veth peer.
v2 (by avagin@):
- don't forget to close opened descriptor
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: use a cached value to dump ipv6 interface addesses
call get_ipv6() from kerndat_init_rst too
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It returns EINTR, so we need to handle it.
$ bash test/zdtm.sh --restore-sibling ns/static/env00
...
futex(0x7fc20ec92010, FUTEX_WAIT, 1, {120, 0}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This commit adds basic support for dumping and restoring seccomp filters
via the new ptrace interface. There are two current known limitations with
this approach:
1. This approach doesn't support restoring tasks who first do a seccomp()
and then a setuid(); the test elaborates on this and I don't think it is
tough to do, but it is not done yet.
2. Filters are compared via memcmp(), so two tasks which have the same
parent task and install identical (via memory) filters will have those
filters considered to be the "same". Since we force all tasks to have
the same creds (including seccomp filters) right now, this isn't a
problem.
The approach used here is very similar to the cgroup approach: the actual
filters are stored in a seccomp.img, and each task has an id that points to
the part of the filter tree it needs to restore. This keeps us from dumping
the same filter multiple times, since filters are inherited on fork.
v2:
* remove unused seccomp_filters field from struct rst_info
* rework memory layout for passing filters to restorer blob
* add a sanity check when finding inherited filters
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: add comments and rename ns_created to ns_populated.
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
close_olds_fds() knows nothing about more than one set of service file
descriptros, so it's better to call it before forking children as it was
bedore 9d60724eca71 ("restore: restore mntns before creating private vma-s")
The root task restores all processes and pin them with file descriptors,
then a task restores a mount namespace by opening the file descriptor of
the root task via /proc/pid/fd/X.
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to open a file to restore a file mapping and this file
can be from a current mntns.
v2: All namespaces are resotred from the root task and then
other tasks calls setns() to set a proper mntns.
v3: fix comments from Pavel
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Grabbed from kernel. Probably worth to gather
all bits manipulators here in future.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Implementing c/r of bridges with slaves shouldn't be too hard (viz. the
comment), but this is all I need to for right now.
v2: remove extra debug statement
v3: * remember to close fd in dump_bridge
* use "known" buffer length and snprintf for spath in dump_bridge
* change brace style
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When live migrating a container with large amount of processes
inside the time to do page-server-ed dump may be up to 10 times
slower than for the local dump.
The delay is always introduced in the open_page_server_xfer()
when criu negotiates the has_parent bit on the 2nd task. This
likely happens because of the Nagel algo taking place -- after
the write() of the OPEN2 command happened kernel delays this
command sending waiting for more data.
v2:
Fix this by turning on CORK option on memory transfer sockets
on send side, and NODELAY one once on urgent data. Receive
side is always NODELAY-ed. According to Alexey Kuznetsov this
is the best mode ever for such type of transfers.
v3:
Push packets in pre-dump's check_parent_server_xfer too.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@odin.com>
Pass function name into a helper instead of pointer
wich doesn't provide much useful info.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reading stops after an EOF or a specified charecter.
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Contrary to a popular opinion, there is no need to check
an argument for being non-NULL before calling free().
>From free(3) man page:
> > If ptr is NULL, no operation is performed.
Let's change xfree macro to be a synonym for free().
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch(set) is inspired by similar from Andrey Vagin sent
sime time earlier.
The major idea is to artificially fail criu dump or restore at
specific places and let zdtm tests check whether failed dump
or restore resulted in anything bad.
This particular patch introduces the ability to tell criu "fail
at X point". Each point is specified with a integer constant
and with the next patches there will appear places over the
code checking for specific fail code being set and failing.
Two points are introduced -- early on dump, right after loading
the parasite and right after creation of the root task.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The kernel prior 4.3 is exporting FS_EVENT_ON_CHILD
bit via procfs fdinfo interface. This bit is kernel's
internal and should not be passed in inotify_add_watch
call. Thus simply filter it out when obtain from old
images for backward compatibility reason.
More details here https://lkml.org/lkml/2015/9/21/680
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This allows the user to perform actions before dumping or restoration
occurs.
Signed-off-by: Matthew Krafczyk <krafczyk.matthew@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In commit c2271198, Laurent Dufour kindly reunified the VDSO code
that had become duplicated between architectures. Unfortunately
this introduced a regression in AArch64 where apparently due to
the scope of vdso_symbols array of pointers to characters changing
from local to global, load-time relocations became necessary.
The following thread on the GCC mailing list discusses why
load-time relocations can be necessary when pointers are used,
although it doesn't mention the potential for locally scoped
arrays to be handled differently:
https://gcc.gnu.org/ml/gcc/2004-05/msg01016.html
Because the alternatives, such as porting piegen to AArch64, are
far more involved, simply revert the change in scope.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In the recent VDSO code reunification, some types were changed but
a pair of necessary corresponding changes was omitted. Fix that so
the AArch64 build succeeds without type-related
warnings-turned-errors. Also move the definition to the
AArch64-specific header since it's not currently being used by any
other architectures.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When in a userns, tasks can't write to certain sysctl files:
(00.009653) 1: Error (sysctl.c:142): Can't open sysctl kernel/hostname: Permission denied
See inline comments for details on affected namespaces.
Mostly for my own education in what is required to port something to be
userns restorable, I ported the sysctl stuff. A potential concern for this
patch is that copying structures with pointers around is kind of gory. I
did it ad-hoc here, but it may be worth inventing some mechanisms to make
it easier, although I'm not sure what exactly that would look like
(potentially re-using some of the protobuf bits; I'll investigate this more
if it looks helpful when doing the cgroup user namespaces port?).
Another issue is that there is not a great way to return non-fd stuff in
memory right now from userns_call; one of the little hacks in this code
would be "simplified" if we invented a way to do this.
v2: coalesce the individual struct sysctl_req requests into one big
sysctl_userns_req that is in a contiguous region of memory so that we
can pass it via userns_call. Hopefully nobody finds my little ascii
diagram too offensive :)
v3: use the fork/setns trick to change the syctl values in the right ns for
IPC/UTS nses; see inline comment for details
v4: only use sysctl_userns_req when actually doing a userns_call.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"ip route dump" dumps only ipv4 routes.
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: use struct irmap directly in irmap_path_opt
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Issue #18. When restore fails ghost files remain there. And
to remove them we have to know their list, paths to original
files (to construct the ghost name) and the namespace ghost
lives in.
For the latter we keep the restore task namespace at hands
till the final stage and setns into it to kill ghosts.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Info about ghosts presence and paths will be needed to
remove the ghosts itself and thus are needed in criu.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
First -- avoid two memory copies by printing ns root directly, and
second -- remove extra argument from create_ghost, the mnt_id value
we need there can be found on the ghost_file object.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So here it is. If root task dies on restore the roots yard
dir remains unrmdired :( Since we already know its name, we
can remove one from criu. By the time we get to this place
the sub mount namespace(s) are already dead and yard dir
is empty. But umounting should be done by tasks after
successfull restore, so keep depopulation there.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Same thing as in previous patch -- we have too many generic
clean_ and fini_ prefixes over the code. And we need more (see
next patch), so let's specify what exactly we clean or fini.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case root task restore failure we'll have to remove the
roots yard dir from criu, so we have to create one by
criu to at least have the dit name.
It's OK to do it in criu, since the yards is created in
the opts.root which is the same for any mnt ns we deal
with on restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There's already two things we do in criu namespaces before
forking the init task (start unsd and keep netnsfd for back
reference). Next patches will introduce the 3rd action for
mount namespaces, so have a special pre-call for all this
stuff.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Actually make use of the ns->type field and remove all getpid()'s
and other strange/inconsistent checks.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We (may) have 3 types of namespace objects in criu -- criu's one,
root task's one and others. All of them sometimes make sense and
we differentiate them in a weird way -- by checking the ns->pid
field against getpid() or by comparing with root_item's.
The proposal is to mark ns_id objects explicitly with type field.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this in the next patch to correctly write sysctls.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll use this size in the next patch to avoid having to do some dynamic
allocation.
v2: call it MAX_UNSFD_MSG_SIZE instead
v3: fix all uses of MAX_MSG_SIZE :)
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They both can container the MS_READONLY flag. And in one case it will be
read-only bind-mount and in another case it will be read-only
super-block.
v2: set mnt and sb for one call of mount() when it's posiable
v3: return a comment which was deleted by mistake
v4: Fix the sentense about restoring mnt flags
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This option allows users to specify their own irmap paths to scan in the event
that they don't have a path in one of the hard coded hints.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Some controllers can be disabled in kernel options. In this case they
are shown in /proc/cgroups, but they could not be mounted.
All enabled controllers can be collected from /proc/self/cgroup.
https://github.com/xemul/criu/issues/28
v2: ',' is used to separate controllers
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>