2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-26 11:57:52 +00:00

1696 Commits

Author SHA1 Message Date
Tycho Andersen
e6a3aef43e remap: don't allocate dead pids in wrong context
Closes #87

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
CC: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-23 11:47:29 +03:00
Tycho Andersen
cc9587ffc5 seccomp: is optional when parsing /proc/pid/status
Also define some constants for people who don't have them in their headers.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-23 11:44:50 +03:00
Andrew Vagin
028998c588 proc_parse: parse pending signals
It's required to check the SIGSTOP signal, which can't be blocked.

Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-20 21:13:31 +03:00
Cyrill Gorcunov
7de345d6b7 net: Move node's net fd reference into service fd
So we keep it and dont close inside close_old_fds()
helper but pass into veth creation so the kernel
can fetch the net namespace of the veth peer.

v2 (by avagin@):
 - don't forget to close opened descriptor

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-19 16:46:36 +03:00
Andrew Vagin
1e8a0594db net: dump iptables for ipv6 (v2)
v2: don't dump iptables if ipv6 isn't supported
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-19 15:19:01 +03:00
Andrew Vagin
1648db970c kerndat: check whether ipv6 is supported or not (v2)
v2: use a cached value to dump ipv6 interface addesses
    call get_ipv6() from kerndat_init_rst too

Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-19 15:18:08 +03:00
Andrew Vagin
a2780c6131 lock: futex() with timeout isn't restarted after signals (v2)
It returns EINTR, so we need to handle it.

$ bash test/zdtm.sh --restore-sibling ns/static/env00
...
futex(0x7fc20ec92010, FUTEX_WAIT, 1, {120, 0}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-19 15:15:39 +03:00
Andrew Vagin
4c00ac2908 lock: print a message if a futex is locked for more than 120 second
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-17 10:52:22 +03:00
Tycho Andersen
221af18ea0 seccomp: add support for SECCOMP_MODE_FILTER
This commit adds basic support for dumping and restoring seccomp filters
via the new ptrace interface. There are two current known limitations with
this approach:

1. This approach doesn't support restoring tasks who first do a seccomp()
   and then a setuid(); the test elaborates on this and I don't think it is
   tough to do, but it is not done yet.

2. Filters are compared via memcmp(), so two tasks which have the same
   parent task and install identical (via memory) filters will have those
   filters considered to be the "same". Since we force all tasks to have
   the same creds (including seccomp filters) right now, this isn't a
   problem.

The approach used here is very similar to the cgroup approach: the actual
filters are stored in a seccomp.img, and each task has an id that points to
the part of the filter tree it needs to restore. This keeps us from dumping
the same filter multiple times, since filters are inherited on fork.

v2:
 * remove unused seccomp_filters field from struct rst_info
 * rework memory layout for passing filters to restorer blob
 * add a sanity check when finding inherited filters

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-17 10:51:20 +03:00
Andrew Vagin
b78af1923b mount: wait when mntns will be created to get its root (v2)
v2: add comments and rename ns_created to ns_populated.

Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-17 10:46:00 +03:00
Andrew Vagin
7017181849 mount: don't inherit mount namespace descriptors to each process
close_olds_fds() knows nothing about more than one set of service file
descriptros, so it's better to call it before forking children as it was
bedore 9d60724eca71 ("restore: restore mntns before creating private vma-s")

The root task restores all processes and pin them with file descriptors,
then a task restores a mount namespace by opening the file descriptor of
the root task via /proc/pid/fd/X.

Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-17 10:45:09 +03:00
Andrew Vagin
9d60724eca restore: restore mntns before creating private vma-s (v3)
We need to open a file to restore a file mapping and this file
can be from a current mntns.

v2: All namespaces are resotred from the root task and then
other tasks calls setns() to set a proper mntns.

v3: fix comments from Pavel
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-14 09:53:47 +03:00
Pavel Emelyanov
dc00fea333 net: Dont print error in rule save
This thing is new and can be absent in ip tool, which is OK
and is handled by net.c code itself.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 16:31:21 +03:00
Pavel Emelyanov
18d9170858 util: Add flags to cr_system
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 16:31:19 +03:00
Cyrill Gorcunov
ee2409ec37 compiler: Grab min_t, max_t from the kernel
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 14:57:00 +03:00
Cyrill Gorcunov
ba475b8dcf bitmap -- Add few helpers for bits manipulations
Grabbed from kernel. Probably worth to gather
all bits manipulators here in future.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 11:15:02 +03:00
Pavel Emelyanov
780d699401 page-read: Teach page-read to read multiple pages at once
This is preparatory patch, the problem to solve is described in
the next one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 11:14:43 +03:00
Tycho Andersen
8a95be0679 net: allow c/r of empty bridges in the container
Implementing c/r of bridges with slaves shouldn't be too hard (viz. the
comment), but this is all I need to for right now.

v2: remove extra debug statement
v3: * remember to close fd in dump_bridge
    * use "known" buffer length and snprintf for spath in dump_bridge
    * change brace style

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-12 10:31:58 +03:00
Pavel Emelyanov
a20ed3c6f0 page-server: Fine grained corking control (v3)
When live migrating a container with large amount of processes
inside the time to do page-server-ed dump may be up to 10 times
slower than for the local dump.

The delay is always introduced in the open_page_server_xfer()
when criu negotiates the has_parent bit on the 2nd task. This
likely happens because of the Nagel algo taking place -- after
the write() of the OPEN2 command happened kernel delays this
command sending waiting for more data.

v2:
Fix this by turning on CORK option on memory transfer sockets
on send side, and NODELAY one once on urgent data. Receive
side is always NODELAY-ed. According to Alexey Kuznetsov this
is the best mode ever for such type of transfers.

v3:
Push packets in pre-dump's check_parent_server_xfer too.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@odin.com>
2015-11-10 16:00:25 +03:00
Pavel Emelyanov
d6d06c9dfc Open proc links with O_PATH
These three are like map_files one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2015-11-10 15:58:36 +03:00
Cyrill Gorcunov
049a7c828a userns: Wrap call with a macro fore readability
Pass function name into a helper instead of pointer
wich doesn't provide much useful info.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-11-05 15:29:04 +03:00
Kirill Tkhai
c9afd17ad6 net: Add ip rule save/restore
Add support for save and restore of ip rules. It uses new
functionality of iproute which is already in iproute git:

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=2f4e171f7df22107b38fddcffa56c1ecb5e73359

v2: Use xstrdup() instead of strdup().
v3: Use open/close instead of helper.
v4: Return -1 on empty dump.

Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-27 22:56:33 +03:00
Andrew Vagin
1d8fcb6b94 bfd: add breadchr
Reading stops after an EOF or a specified charecter.

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-27 22:51:09 +03:00
Cyrill Gorcunov
7a99e699ce mnt: Export __open_mountpoint
We gonna need it for inotify handle testing.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-21 15:08:03 +03:00
Kir Kolyshkin
5940e3d14c xfree(): simplify
Contrary to a popular opinion, there is no need to check
an argument for being non-NULL before calling free().

>From free(3) man page:

> > If ptr is NULL, no operation is performed.

Let's change xfree macro to be a synonym for free().

Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-21 14:58:39 +03:00
Pavel Emelyanov
68baf8e77d criu: Fault injection core
This patch(set) is inspired by similar from Andrey Vagin sent
sime time earlier.

The major idea is to artificially fail criu dump or restore at
specific places and let zdtm tests check whether failed dump
or restore resulted in anything bad.

This particular patch introduces the ability to tell criu "fail
at X point". Each point is specified with a integer constant
and with the next patches there will appear places over the
code checking for specific fail code being set and failing.

Two points are introduced -- early on dump, right after loading
the parasite and right after creation of the root task.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-19 12:42:29 +03:00
Cyrill Gorcunov
61859d1176 fsnotify: Filter out internal inotify bits when restoring marks
The kernel prior 4.3 is exporting FS_EVENT_ON_CHILD
bit via procfs fdinfo interface. This bit is kernel's
internal and should not be passed in inotify_add_watch
call. Thus simply filter it out when obtain from old
images for backward compatibility reason.

More details here https://lkml.org/lkml/2015/9/21/680

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-14 15:51:55 +03:00
Matthew Krafczyk
29c08d8672 Add pre-dump and pre-restore action scripts
This allows the user to perform actions before dumping or restoration
occurs.

Signed-off-by: Matthew Krafczyk <krafczyk.matthew@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-09 18:23:41 +03:00
Christopher Covington
871da9a111 pie: Give VDSO symbol table local scope
In commit c2271198, Laurent Dufour kindly reunified the VDSO code
that had become duplicated between architectures. Unfortunately
this introduced a regression in AArch64 where apparently due to
the scope of vdso_symbols array of pointers to characters changing
from local to global, load-time relocations became necessary.

The following thread on the GCC mailing list discusses why
load-time relocations can be necessary when pointers are used,
although it doesn't mention the potential for locally scoped
arrays to be handled differently:
https://gcc.gnu.org/ml/gcc/2004-05/msg01016.html

Because the alternatives, such as porting piegen to AArch64, are
far more involved, simply revert the change in scope.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:21:16 +03:00
Christopher Covington
627f9a9e5f aarch64: Fix write_intraprocedure_branch types
In the recent VDSO code reunification, some types were changed but
a pair of necessary corresponding changes was omitted. Fix that so
the AArch64 build succeeds without type-related
warnings-turned-errors. Also move the definition to the
AArch64-specific header since it's not currently being used by any
other architectures.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:20:01 +03:00
Tycho Andersen
f79f4546cf sysctl: move sysctl calls to usernsd
When in a userns, tasks can't write to certain sysctl files:

(00.009653)      1: Error (sysctl.c:142): Can't open sysctl kernel/hostname: Permission denied

See inline comments for details on affected namespaces.

Mostly for my own education in what is required to port something to be
userns restorable, I ported the sysctl stuff. A potential concern for this
patch is that copying structures with pointers around is kind of gory. I
did it ad-hoc here, but it may be worth inventing some mechanisms to make
it easier, although I'm not sure what exactly that would look like
(potentially re-using some of the protobuf bits; I'll investigate this more
if it looks helpful when doing the cgroup user namespaces port?).

Another issue is that there is not a great way to return non-fd stuff in
memory right now from userns_call; one of the little hacks in this code
would be "simplified" if we invented a way to do this.

v2: coalesce the individual struct sysctl_req requests into one big
    sysctl_userns_req that is in a contiguous region of memory so that we
    can pass it via userns_call. Hopefully nobody finds my little ascii
    diagram too offensive :)
v3: use the fork/setns trick to change the syctl values in the right ns for
    IPC/UTS nses; see inline comment for details
v4: only use sysctl_userns_req when actually doing a userns_call.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:16:14 +03:00
Andrew Vagin
a973e6fcb3 net: dump ipv6 routes
"ip route dump" dumps only ipv4 routes.

Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-10-05 13:11:31 +03:00
Tycho Andersen
97cb181cbc irmap: don't leak irmap objects in --irmap-scan-path
v2: use struct irmap directly in irmap_path_opt

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:02:51 +03:00
Pavel Emelyanov
efa7dcf7c2 ghost: Remove ghost files if restore fails
Issue #18. When restore fails ghost files remain there. And
to remove them we have to know their list, paths to original
files (to construct the ghost name) and the namespace ghost
lives in.

For the latter we keep the restore task namespace at hands
till the final stage and setns into it to kill ghosts.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:37 +03:00
Pavel Emelyanov
a7c9f3011d mnt: Read mount images early
Mappings from mount id to namespace will be required to
remove ghosts on restore failure.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:36 +03:00
Pavel Emelyanov
b0e23c3d4f files: Collect ghosts and regilfes early
Info about ghosts presence and paths will be needed to
remove the ghosts itself and thus are needed in criu.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 22:00:35 +03:00
Pavel Emelyanov
152222a6b7 remap: Sanitize ghost file path printing
First -- avoid two memory copies by printing ns root directly, and
second -- remove extra argument from create_ghost, the mnt_id value
we need there can be found on the ghost_file object.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:59:45 +03:00
Pavel Emelyanov
6cf77f6726 remap: Rename fields for easier grep
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:58:28 +03:00
Pavel Emelyanov
7ca6cc1eb2 mnt: Clean roots yard from criu process
So here it is. If root task dies on restore the roots yard
dir remains unrmdired :( Since we already know its name, we
can remove one from criu. By the time we get to this place
the sub mount namespace(s) are already dead and yard dir
is empty. But umounting should be done by tasks after
successfull restore, so keep depopulation there.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:57:35 +03:00
Pavel Emelyanov
3e7c92ed02 mnt: Renames around roots yard
Same thing as in previous patch -- we have too many generic
clean_ and fini_ prefixes over the code. And we need more (see
next patch), so let's specify what exactly we clean or fini.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:57:21 +03:00
Pavel Emelyanov
c5c65fe17a mnt: Create roots in criu context
In case root task restore failure we'll have to remove the
roots yard dir from criu, so we have to create one by
criu to at least have the dit name.

It's OK to do it in criu, since the yards is created in
the opts.root which is the same for any mnt ns we deal
with on restore.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:56:51 +03:00
Pavel Emelyanov
e3f5ba3c37 ns: Prepare namespaces before tasks
There's already two things we do in criu namespaces before
forking the init task (start unsd and keep netnsfd for back
reference). Next patches will introduce the 3rd action for
mount namespaces, so have a special pre-call for all this
stuff.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 21:56:26 +03:00
Pavel Emelyanov
9b3189fed1 util: Add make_yard helper
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-28 11:32:18 +03:00
Pavel Emelyanov
9353051ba7 ns: Check ns type with type field
Actually make use of the ns->type field and remove all getpid()'s
and other strange/inconsistent checks.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:15:28 +03:00
Pavel Emelyanov
22b7256612 ns: Introduce ns type
We (may) have 3 types of namespace objects in criu -- criu's one,
root task's one and others. All of them sometimes make sense and
we differentiate them in a weird way -- by checking the ns->pid
field against getpid() or by comparing with root_item's.

The proposal is to mark ns_id objects explicitly with type field.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:14:07 +03:00
Tycho Andersen
85ebf0a83b usernsd: also pass pid of process that made the req
We'll use this in the next patch to correctly write sysctls.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 12:01:01 +03:00
Tycho Andersen
72ff44d0dc usernsd: move MAX_MSG_SIZE to namespaces.h
We'll use this size in the next patch to avoid having to do some dynamic
allocation.

v2: call it MAX_UNSFD_MSG_SIZE instead
v3: fix all uses of MAX_MSG_SIZE :)

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:57:40 +03:00
Andrey Vagin
1174a2ad0f mount: handle mnt_flags and sb_flags separatly (v4)
They both can container the MS_READONLY flag. And in one case it will be
read-only bind-mount and in another case it will be read-only
super-block.

v2: set mnt and sb for one call of mount() when it's posiable
v3: return a comment which was deleted by mistake
v4: Fix the sentense about restoring mnt flags
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:55:17 +03:00
Tycho Andersen
4f2e4ab3be irmap: add --irmap-scan-path option
This option allows users to specify their own irmap paths to scan in the event
that they don't have a path in one of the hard coded hints.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-21 11:46:12 +03:00
Andrey Vagin
d3be641acd cgroups: get controllers from /proc/self/cgroups (v2)
Some controllers can be disabled in kernel options. In this case they
are shown in /proc/cgroups, but they could not be mounted.

All enabled controllers can be collected from /proc/self/cgroup.

https://github.com/xemul/criu/issues/28

v2: ',' is used to separate controllers

Cc: Tycho Andersen <tycho.andersen@canonical.com>
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2015-09-16 15:46:10 +03:00