Reasoning: some systems have /sys/fs/cgroup stuff mounted as read-only
and we have to either remount it rw or create our own set. The former
doesn't look sane as this rw remounting is also done by ststemd, so
let's return back to manual cgyard construction.
This reverts commit 860df95f859cf7ba23b57fc832793c623a5897e4.
Conflicts:
cgroup.c
include/cr_options.h
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When been playing wich checkpoint/restore of container I found
that we can't reuse existing controller if they were pre-created.
For example currently in PCS7 we're bindmount cgroups which belong
to a container in a form of
/sys/fs/cgroup/<controller>/<container> ==> /sys/fs/cgroup/<controller>
so that CRIU dumps such configuration fine but on restore
it recreates controllers from the scratch which we would
like to bindmount them and ask CRIU to restore subcgroups
and their parameters.
So I extended --manage-cgroups option to take <mode> arguments.
Detailed description in docs.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we always create temporary directory where we restore
cgroups, but this won't work in case if mounting cgroups is forbidden
from inside of a container for some reason (as in OpenVZ kernel).
So one can pass --cgroup-yard option to specify an existing
directory where cgroups are living. By default we assume it
lays in /sys/fs/cgroup.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This option enables external (slave) bind mounts to be resolved.
v2: don't always assume that when the master id matches, the mounts match
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
With this flag, external shared bind mounts are attempted to be resolved
automatically.
v2: don't always assume when the sharing matches that the mount matches
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When this option is specified, if an external (private) bind mount is not
specified by --ext-mount-map KEY:VAL then it is attempted to be resolved
automatically.
v2: introduce find_best_external_match, which looks for the best match based on
sharing/slave ids; don't try to resolve fsroot_mounted() mountpoints
v3: get rid of really_collect_self_mounts
v4: get rid of fsroot_mounted() check when autodetecting external mounts
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They both are using 'd' option in different context
though, lets give them two names.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this mode we test if target cpu has all features present
in image file but do not require bit to bit match: target cpu
may be a new one with more features present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are cases where a process's file descriptor cannot be restored
from the checkpoint images. For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.
There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time. For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.
In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option. The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.
As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d. This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.
It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited. So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.
For examples please visit http://criu.org/Category:HOWTO.
v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
Also, as an extra measure of caution, added checks in the inherit fd
look up functions to make sure that the inherit fd hasn't been reused.
The patch also includes minor cosmetic changes.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They will serve to choose capability level when migrating
images between various hardware nodes.
Note it's bare functionality introduced in this commit,
the real implementation is in next patches.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a slight mess with how criu restores root task.
Right now we have the following options.
1) CLI
a) Usually
task calling criu
`- criu
`- root restored task
b) when --restore-detached AND root has pdeath_sig
task calling criu
`- criu
`- root restored task
2) Library/SWRK
task using lib/swrk
`- criu
`- root restored task
3) Standalone service
a) Usually
service
`- service sub task
`- root restored task
b) when root has pdeath_sig
criu service
`- criu sub task
`- root restored task
It would be better is CRIU always restored the root task as sibling,
but we have 3 constraints:
First, the case 1.a is kept for zdtm to run tests in pid namespaces
on 3.11, which in turn doesn't allow CLONE_PARENT | CLONE_NEWPID.
Second, CLI w/o --restore-detach waits for the restored task to die and
this behavior can be "expected" already.
Third, in case of standalone service tasks shouldn't become service's
children.
And I have one "plan". The p.haul project while live migrating tasks
on destination node starts a service, which uses library/swrk mode. In
this case the restored processes become p.haul service's kids which is
also not great.
That said, here's the option called --restore-child that pairs the
--restore-detach like this:
* detached AND child:
task
`- criu restore (exits at the end)
`- root task
The root task will become task's child.
This will be default to library/swrk.
This is what LXC needs.
* detach AND !child
task
`- criu restore (exits at the end)
`- root task
The root task will get re-parented to init.
This will be compatible with 1.3.
This will be default to standalone service and
to my wish with the p.haul case.
* !detach AND child
task
`- criu restore (waits for root task to die)
`- root task
This should be deprecated, so that criu restore doesn't mess
with task <-> root task signalling.
* !detach AND !child
task
`- criu restore (waits for root task to die)
`- root task
This is how plain criu restore works now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
The swrk action is turning out to be a cool thing. We can
spawn criu with swrk action with some FD being open, then
ask for dump/pre-dump/page-server telling it that some
descriptor it needs is "out there".
This patch lets us specify that the page server communication
channel is already in criu's fdtable.
TODO: teach regular service to accept fd via service socket.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root. This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The motivation for this is to be able to restore containers into cgroups other
than what they were dumped in (if, e.g. they might conflict with an existing
container). Suppose you have a container in:
memory:/mycontainer
cpuacct,cpu:/mycontainer
blkio:/mycontainer
name=systemd:/mycontainer
You could then restore them to /mycontainer2 via --cgroup-root /mycontainer2.
If you want to restore different controllers to different paths, you can
provide multiple arguments, for example, passing:
--cgroup-root /mycontainer2 --cgroup-root cpuacct,cpu:/specialcpu \
--cgroup-root name=systemd:/specialsystemd
Would result in things being restored to:
memory:/mycontainer2
cpuacct,cpu:/specialcpu
blkio:/mycontainer2
name=systemd:/specialsystemd
i.e. a --cgroup-root without a controller prefix specifies the new default root
for all cgroups.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
criu managed cgroups is now an opt-in thing, so by default criu does not manage
(i.e. dump or restore) cgroups. This allows users to use the previous behavior.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Andrey validly pointed out, that restoring pdeath_sig is not
compatible with criu_restore_child() call -- after criu restore
children, it will exit and fire the pdeath_sig into restored
tree root, potentially killing it.
The fix for that could be -- when started in swrk more, criu can
restore tree not as children tasks, but as siblings, using the
CLONE_PARENT flag when fork()-ing the root task.
With this we should also take care about errors handing -- right
now criu catches the SIGCHILD from dying children tasks, and
since we plan to create them be children of the criu parent (the
library caller) we will not be able to catch them. To do so we
SEIZE the root task in advance thus causing all SIGCHLD-s go to
criu, not to its parent.
Having this done we no longer need the SUBREAPER trick in the
library call -- tasks get restored right as callers kids :)
Some thoughts for future -- using this trick we can finally make
"natural" restoration of shell jobs. I.e. -- make criu restore
some subtree right under bash, w/o leaving itself as intermediate
task and w/o re-parenting the subtree to init after restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.
On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.
v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The --exec-cmd option specifies a command that will be execvp()-ed on successful
restore. This way the command specified here will become the parent process of
the restored process tree.
Waiting for the restored processes to finish is responsibility of this command.
All service FDs are closed before we call execvp(). Standad output and error of
the command are redirected to the log file when we are restoring through the RPC
service.
This option will be used when restoring LinuX Containers and it seems helpful
for perf or other use cases when restored processes must be supervised by a
parent.
Two directions were researched in order to integrate CRIU and LXC:
1. We tell to CRIU, that after restoring container is should execve()
lxc properly explaining to it that there's a new container hanging
around.
2. We make LXC set himself as child subreaper, then fork() criu and ask
it to detach (-d) from restore container afterwards. Being a subreaper,
it should get the container's init into his child list after it.
The main reason for choosing the first option is that the second one can't work
with the RPC service. If we call restore via the service then criu service will
be the top-most task in the hierarchy and will not be able to reparent the
restore trees to any other task in the system. Calling execve from service
worker sub-task (and daemonizing it) should solve this.
Signed-off-by: Deyan Doychev <deyandoichev@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When migrating container with copying its FS, the inode numbers
and thus their handles wil change. This will make the restore of
inotify/fanotify fail, since they do it via fhandles.
We've already faced the problems with fsnotifies on NFS -- they
don't work there. To address this an irmap cache is created on
pre-dump, so to resolve the issue with changed inodes during
migration, we can force the irmap cache build.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This option will serve to manage CPU capabilities
to be matched/ignored on restore procedure. At the
moment we introduce 'fpu','all' capability arguments.
By default 'all' is set.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Service shouldn't call client provided scripts, as it
creates a security issue (client may be unpriviledged,
while the service is).
In order to let caller do what it would normally do with
criu-scripts, make criu notify it about scripts. Caller
then do whatever it needs and responds back.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The -F|--fields option specifies which fields (by name, comma
separated) should be printed.
For nested fields all names in path should be specified.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Libraries (plugins) is going to be used for dumping and restoring
external dependencies (e.g. dbus, systemd journal sockets, charecter
devices, etc)
A plugin can have the cr_plugin_init() and cr_plugin_fini functions for
initialization and deinialization.
criu-plugin.h contains all things, which can be used in plugins.
v2: rename lib to plugin
v3: add a default value for a plugin path.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For big #ifdef/#endif chunks we do a comment /* */
at #endif. Add missing ones.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>