This commit is in preparation for the (hopefully last :) restore special cpuset
patch.
Previously, we installed the cgroup service fd after calling
prepare_cgroup_dirs, which meant that we had to carry around the temporary
directory name in order to put things in the right place. The
restore_cgroup_prop function uses the cg service fd instead of carrying around
the full path. This means that we can't sue restore_cgroup_prop, without first
sanitizing the path. Instead, we install the service fd before calling
prepare_cgroup_dirs, and all the code just references that instead of carrying
around the temporary path.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping tasks we do a lot of open_proc()-s and to
speed this up the /proc/pid directory is opened first
and the fd is kept cached. So next open_proc()-s do just
openat(cached_fd, name).
The thing is that we sometimes call open_proc(PROC_SELF)
in between and proc helpers cache the /proc/self too. As
the result we have a bunch of
open(/proc/pid)
close()
open(/proc/self)
close()
see-saw-s in the middle of dumping tasks.
To fix this we may cache the /proc/self separately from
the /proc/pid descriptor. This eliminates quite a lot
of pointless open-s and close-s.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When service starts page server all the preparations (log, wdir, img dir, etc.)
happen in parent task, then we fork page server.
This is OK for now, but when we will serve several requests per connection, all
these resources would be leaked in parent.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We can't dump netlink socket, inotify, fanotify, if they have queued
data, so lets add a function to chech this.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When doing a restore for LXC, we store some other metadata (which bridge a veth
was on) in the image directory so that the restore script can correctly unlock
a network device and attach it to the right interface. This patch is needed so
that the script can find this metadata.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Before we would not detect the mount point for co-mounted controllers. Things
still worked because we'd just re-mount them ourselves and traverse our own
mount point, but this saves an extra mount().
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is for debug purpose mostly.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a set of routines that open /proc/$pid files via proc service
descriptor. Teach them to accept non-pids as pids to open /proc/self/*
and /proc/* files via the same engine.
Signed-f-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a race. Consider we have 3 tasks, A, B and C. A and B
share fdtable, C -- does not. Then we might be in a situation
when A is restoring memory reading mem images, and B -- forking
the C child. In that case descriptors held by A (for mem restore)
will be inherited by C and will not get closed.
This reverts commit d36e07aabe073993d8ae9695e33f6e45b2eb6a21.
For all other tasks only unsed service descriptors will be closed.
This change allows to have file descriptors, which may be used for
restoring namespaces. All non-server descriptors must be closed before
restoring files.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Opening /dev/null may fail, check for ret code.
CID 1168167
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The --exec-cmd option specifies a command that will be execvp()-ed on successful
restore. This way the command specified here will become the parent process of
the restored process tree.
Waiting for the restored processes to finish is responsibility of this command.
All service FDs are closed before we call execvp(). Standad output and error of
the command are redirected to the log file when we are restoring through the RPC
service.
This option will be used when restoring LinuX Containers and it seems helpful
for perf or other use cases when restored processes must be supervised by a
parent.
Two directions were researched in order to integrate CRIU and LXC:
1. We tell to CRIU, that after restoring container is should execve()
lxc properly explaining to it that there's a new container hanging
around.
2. We make LXC set himself as child subreaper, then fork() criu and ask
it to detach (-d) from restore container afterwards. Being a subreaper,
it should get the container's init into his child list after it.
The main reason for choosing the first option is that the second one can't work
with the RPC service. If we call restore via the service then criu service will
be the top-most task in the hierarchy and will not be able to reparent the
restore trees to any other task in the system. Calling execve from service
worker sub-task (and daemonizing it) should solve this.
Signed-off-by: Deyan Doychev <deyandoichev@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This reverts commit 66ab5e1ad8ac682c9225446747885184b7bf41b5.
After Andrey's fixes that create mount points before dropping
old mounts and going to pivot_root, this patch is not needed.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We call tar, ip, iptables, etc. when restoring container.
The problem is that these stuff is called from inside new
mount namespace after pivot_root(). But the execvp uses
PATH variable inherited from the host system, which may
not reflect real binaries layout.
Add "/bin" to path as temporary workaround.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
otherwise it won't compile:
util.c: In function ‘cr_daemon’:
util.c:594:8: error: ignoring return value of ‘chdir’, declared
with attribute warn_unused_result [-Werror=unused-result]
chdir("/");
^
Signed-off-by: Tikhomirov Pavel <snorcht@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we will read all VmaEntries in one big MmEntry object,
so to avoif copying them all into vma_areas, make them be pointable.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The is_foo_link readlinks the lfd to check. This makes
anon-inodes dumping readlink several times to find proper
dump ops. Optimize this thing.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Service shouldn't call client provided scripts, as it
creates a security issue (client may be unpriviledged,
while the service is).
In order to let caller do what it would normally do with
criu-scripts, make criu notify it about scripts. Caller
then do whatever it needs and responds back.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
RPC will start page-server daemon and needs to get the
controll back to report back to caller, but the glibc's
daemon() does exit() in parent context preventing it.
Thus -- introduce own daemonizing routine.
Strictly speaking, this is not pure daemon() clone, as the
parent process has to exit himself. But this is OK for now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case criu check is run as non-root, a lot of information is printed
to a user, with the only missing bit is it should run it as root.
Fix it.
I still don't like the fact that some other stuff is printed here,
like the timestamp and the __FILE__:__LINE__, but this should be
fixed separately.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is usefull if one needs to do some final action before checkpoint
is complete. For example in case of online migration one may provide
a script which would check the restore procedure on remote note
ended without errors, thus the script returns zero code and criu
simply kills running instance of application.
In turn, if migration failed, the script can return nonzero code
and criu won't kill the application but continue its execution
instead.
https://bugzilla.openvz.org/show_bug.cgi?id=2583
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When a kernel didn't show vma flags, we set MAP_GROWSDOWN for stack
vmas, but it's not reliable. E.g. thread stacks are mapped without
MAP_GROWSDOWN.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is less useful than fixing typos in output messages, but anyway.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>