Finally add --enable-fs option to specify the comma separated list of
filesystem names which should be treated as FSTYPE_AUTO.
Note: obviously this option is not safe, use at your own risk. "dump"
will always succeed if the mntpoint is auto, but "restore" can fail or
do something wrong if mount(src, mountpoint, flags, options) can not
actually "just work" as FSTYPE_AUTO logic expects.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They both are using 'd' option in different context
though, lets give them two names.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Which obviously can be used to "ignore" the mounts we do not want or
need to dump. The user should know what he does.
Note: this patch changes parse_mountinfo() to check should_skip_mount().
This is because imo we want to filter out the unwanted mounts asap, af
if they do not exist. This increases the chances the dumping will fail
if something else depends on this mount. Say, another mountpoint or an
opened file.
Perhaps it makes sense to teach should_skip_mount() to use fnmatch()
and/or look at the optional "(fs|mnt)=" prefix to skip by fsname too.
To me it would be better to force the user of this option to understand
what it does. Say, if "dump" fails because the child mount can't find
the skipped parent, he should add another --skip-mnt option or do not
dump. Otherwise, if we do this automagically the user can probably be
surpised, he might even miss the fact that we skip more than he asked.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is required to use criu swrk in libcontainer.
v2: remove useless function declaration
allow to set inherit_fd only for swrk
v3: check swrk out of loop
Cc: Saied Kazemi <saied@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Manage to forgot to increase pointer by length
of the words parsed.
Reported-by: Mark O'Neill <mao@tumblingdice.co.uk>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Right now we state that CRIU works on 3.11 and above kernels and, at the
same time, have support for a couple of new features like aio, tun, timerfd
etc. available in later kernels. Since these new features do not break
generic operations we do not require them in the kernel strictly.
However, in the zdtm tests it's very important to know exactly what can
and what cannot be tested. Right now this is done in a tough manner -- if
the kernel is not 3.11 or criu check fails for _any_ reason we treat the
kernel as being "bad" and throw out a set of tests.
I propose to test some individual features and form the list of tests
in a more fine-grained manner.
This patch only fixes the AIO, mnt_id, tun and posix-timers tests. Next
I will add checks and fixes for user-namespaces tests.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
When restoring a pair of veth devices that had one end inside a namespace
or container and the other end outside, CRIU creates a new veth pair,
puts one end in the namespace/container, and names the other end from
what's specified in the --veth-pair IN=OUT command line option.
This patch allows for appending a bridge name to the OUT string in the
form of OUT@<BRIDGE-NAME> in order for CRIU to move the outside veth to
the named bridge. For example, --veth-pair eth0=veth1@br0 tells CRIU
to name the peer of eth0 veth1 and move it to bridge br0.
This is a simple and handy extension of the --veth-pair option that
obviates the need for an action script although one can still do the same
(and possibly more) if they prefer to use action scripts.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In this mode we test if target cpu has all features present
in image file but do not require bit to bit match: target cpu
may be a new one with more features present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are cases where a process's file descriptor cannot be restored
from the checkpoint images. For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.
There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time. For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.
In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option. The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.
As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d. This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.
It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited. So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.
For examples please visit http://criu.org/Category:HOWTO.
v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
Also, as an extra measure of caution, added checks in the inherit fd
look up functions to make sure that the inherit fd hasn't been reused.
The patch also includes minor cosmetic changes.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
See the comment below for an explanation of what is going on. We will
ultimately need to handle dumping the netlink data, but I think it is good to
prevent injecting events into the stream during a dump. So we pre-load the
modules, even though it isn't very pretty.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On Wed, Oct 01, 2014 at 05:51:09PM +0400, Pavel Emelyanov wrote:
> > Yes, what you've been expecting?
>
> if (!strcmp(argv[optind]))
> return cpu_cap_check()
>
> or smth like this.
updated. So if it become confusing -- feel free to merge [1;9] and
ping me to resend the rest, or pick up from attachements.
>From 6af96ff63ac82f9566c3cba9c116dc67698c9797 Mon Sep 17 00:00:00 2001
From: Cyrill Gorcunov <gorcunov@openvz.org>
Date: Tue, 30 Sep 2014 18:33:40 +0400
Subject: [PATCH] cpuinfo: Add "cpuinfo [dump|check]" commands
They allow to validate cpuinfo information
without running complete dump/restore actions.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On Wed, Oct 01, 2014 at 04:57:40PM +0400, Pavel Emelyanov wrote:
> On 10/01/2014 01:07 AM, Cyrill Gorcunov wrote:
> > On Tue, Sep 30, 2014 at 09:18:53PM +0400, Cyrill Gorcunov wrote:
> >> If a user requested criu to dump cpuinfo image then we
> >> write one on dump and verify on restore. At the moment
> >> we require all cpu feature bits to match the destination
> >> cpu in a sake of simplicity, but in future we need deps
> >> engine which would filer out bits and test if cpu we're
> >> restoring on is more capable than one we were dumping at
> >> allowing to proceed restore procedure.
> >>
> >> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> >
> > Updated to new img format
Something like attached?
>From 59272a9514311e6736cddee08d5f88aa95d49189 Mon Sep 17 00:00:00 2001
From: Cyrill Gorcunov <gorcunov@openvz.org>
Date: Thu, 25 Sep 2014 16:04:10 +0400
Subject: [PATCH] cpuinfo: x86 -- Add dump and validation of cpuinfo image
If a user requested criu to dump cpuinfo image then we
write one on dump and verify on restore. At the moment
we require all cpu feature bits to match the destination
cpu in a sake of simplicity, but in future we need deps
engine which would filer out bits and test if cpu we're
restoring on is more capable than one we were dumping at
allowing to proceed restore procedure.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They will serve to choose capability level when migrating
images between various hardware nodes.
Note it's bare functionality introduced in this commit,
the real implementation is in next patches.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This slightly changes the visible API, but i think it's safe now.
The idea behind it to make single --cpu-cap to indicate "--cpu-cap all".
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a slight mess with how criu restores root task.
Right now we have the following options.
1) CLI
a) Usually
task calling criu
`- criu
`- root restored task
b) when --restore-detached AND root has pdeath_sig
task calling criu
`- criu
`- root restored task
2) Library/SWRK
task using lib/swrk
`- criu
`- root restored task
3) Standalone service
a) Usually
service
`- service sub task
`- root restored task
b) when root has pdeath_sig
criu service
`- criu sub task
`- root restored task
It would be better is CRIU always restored the root task as sibling,
but we have 3 constraints:
First, the case 1.a is kept for zdtm to run tests in pid namespaces
on 3.11, which in turn doesn't allow CLONE_PARENT | CLONE_NEWPID.
Second, CLI w/o --restore-detach waits for the restored task to die and
this behavior can be "expected" already.
Third, in case of standalone service tasks shouldn't become service's
children.
And I have one "plan". The p.haul project while live migrating tasks
on destination node starts a service, which uses library/swrk mode. In
this case the restored processes become p.haul service's kids which is
also not great.
That said, here's the option called --restore-child that pairs the
--restore-detach like this:
* detached AND child:
task
`- criu restore (exits at the end)
`- root task
The root task will become task's child.
This will be default to library/swrk.
This is what LXC needs.
* detach AND !child
task
`- criu restore (exits at the end)
`- root task
The root task will get re-parented to init.
This will be compatible with 1.3.
This will be default to standalone service and
to my wish with the p.haul case.
* !detach AND child
task
`- criu restore (waits for root task to die)
`- root task
This should be deprecated, so that criu restore doesn't mess
with task <-> root task signalling.
* !detach AND !child
task
`- criu restore (waits for root task to die)
`- root task
This is how plain criu restore works now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
When service starts page server all the preparations (log, wdir, img dir, etc.)
happen in parent task, then we fork page server.
This is OK for now, but when we will serve several requests per connection, all
these resources would be leaked in parent.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The swrk action is turning out to be a cool thing. We can
spawn criu with swrk action with some FD being open, then
ask for dump/pre-dump/page-server telling it that some
descriptor it needs is "out there".
This patch lets us specify that the page server communication
channel is already in criu's fdtable.
TODO: teach regular service to accept fd via service socket.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root. This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The return values were getting dangerously close to the range of meaningful
values, in particular the next candidate 63 is equal to '?' which is the
typical return value in case of error.
The return values for long options may be any integer, so bump them up to
outside the ascii range, start above 1000. For ease of review this patch, keep
the existing range (41-62) and increment each value by 1000.
Tested:
- Ran "criu --help", works fine.
- Manual dump and restore with some of the options, worked fine.
- Ran the zdtm test suite, tests passed.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is to make it convenient for service to setup the same thing.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The motivation for this is to be able to restore containers into cgroups other
than what they were dumped in (if, e.g. they might conflict with an existing
container). Suppose you have a container in:
memory:/mycontainer
cpuacct,cpu:/mycontainer
blkio:/mycontainer
name=systemd:/mycontainer
You could then restore them to /mycontainer2 via --cgroup-root /mycontainer2.
If you want to restore different controllers to different paths, you can
provide multiple arguments, for example, passing:
--cgroup-root /mycontainer2 --cgroup-root cpuacct,cpu:/specialcpu \
--cgroup-root name=systemd:/specialsystemd
Would result in things being restored to:
memory:/mycontainer2
cpuacct,cpu:/specialcpu
blkio:/mycontainer2
name=systemd:/specialsystemd
i.e. a --cgroup-root without a controller prefix specifies the new default root
for all cgroups.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
criu managed cgroups is now an opt-in thing, so by default criu does not manage
(i.e. dump or restore) cgroups. This allows users to use the previous behavior.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently, we only check if process gids match primary gid of user.
But process and user have additional groups too. So lets:
1) check that process rgid,egid and sgid are in the user's grouplist.
2) on restore check that user has all groups from the images.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Andrey validly pointed out, that restoring pdeath_sig is not
compatible with criu_restore_child() call -- after criu restore
children, it will exit and fire the pdeath_sig into restored
tree root, potentially killing it.
The fix for that could be -- when started in swrk more, criu can
restore tree not as children tasks, but as siblings, using the
CLONE_PARENT flag when fork()-ing the root task.
With this we should also take care about errors handing -- right
now criu catches the SIGCHILD from dying children tasks, and
since we plan to create them be children of the criu parent (the
library caller) we will not be able to catch them. To do so we
SEIZE the root task in advance thus causing all SIGCHLD-s go to
criu, not to its parent.
Having this done we no longer need the SUBREAPER trick in the
library call -- tasks get restored right as callers kids :)
Some thoughts for future -- using this trick we can finally make
"natural" restoration of shell jobs. I.e. -- make criu restore
some subtree right under bash, w/o leaving itself as intermediate
task and w/o re-parenting the subtree to init after restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>
To help restoring tasks from images as kids to the caller, we can
do the trick.
1. Caller sets himself as child reaper with PR_SET_CHILD_SUBREAPER prctl
2. Caller makes sure criu binary is suid-ed and owned by root
3. Caller forks and calls execv() on criu asking it to restore
4. Criu finishes restore and exits. All its kids get reparented to the
criu's parent, i.e. -- to the library caller.
5. Caller stops being subreaper
In order to make the execv() and arguments passing simpler I propose
to execv() the service worker function, that accepts options via socket.
This is good for two reasons.
1. We don't have to construct CLI options in libcriu
2. We reuse other service's facilities, such as security checks,
ability to dump, pre-dump and other stuff
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.
On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.
v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The --exec-cmd option specifies a command that will be execvp()-ed on successful
restore. This way the command specified here will become the parent process of
the restored process tree.
Waiting for the restored processes to finish is responsibility of this command.
All service FDs are closed before we call execvp(). Standad output and error of
the command are redirected to the log file when we are restoring through the RPC
service.
This option will be used when restoring LinuX Containers and it seems helpful
for perf or other use cases when restored processes must be supervised by a
parent.
Two directions were researched in order to integrate CRIU and LXC:
1. We tell to CRIU, that after restoring container is should execve()
lxc properly explaining to it that there's a new container hanging
around.
2. We make LXC set himself as child subreaper, then fork() criu and ask
it to detach (-d) from restore container afterwards. Being a subreaper,
it should get the container's init into his child list after it.
The main reason for choosing the first option is that the second one can't work
with the RPC service. If we call restore via the service then criu service will
be the top-most task in the hierarchy and will not be able to reparent the
restore trees to any other task in the system. Calling execve from service
worker sub-task (and daemonizing it) should solve this.
Signed-off-by: Deyan Doychev <deyandoichev@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When migrating container with copying its FS, the inode numbers
and thus their handles wil change. This will make the restore of
inotify/fanotify fail, since they do it via fhandles.
We've already faced the problems with fsnotifies on NFS -- they
don't work there. To address this an irmap cache is created on
pre-dump, so to resolve the issue with changed inodes during
migration, we can force the irmap cache build.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This option will serve to manage CPU capabilities
to be matched/ignored on restore procedure. At the
moment we introduce 'fpu','all' capability arguments.
By default 'all' is set.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we specify log level to none (0) the result is LOG_INFO (2).
Acked-by: Andrew Vagin <avagin@parallels.com>
Acked-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Service will call the pre-dump routine, so this is factoring out
enforcin options for CLI and RPC.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
While timestamps can be handy, they clutter output for normal users.
Let's print them only when verbosity (-v) is increased from default.
Currently, default is 2 (-vv) so for timestamps one should use -vvv
or -v3.
Alternatively, we could introduce a separate --timestamps option.
Personally, I find it more handy for timestamps to be tied to log level.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Current description of -v[vvv] was taken from criu-log.h comments
and describes specific log levels used by pr_* functions. The problem
is -vX includes all previous X-1, X-2... levels. Say, -vvvv description
says "debug only", while in fact it is not "only". Fix accordingly.
Also, removed -v0 description as it is useless. What -v0 in fact does is
it sets the log level to default -- same as if -vXXX is not used.
In addition, change a delimiter between option alternatives from comma
to a vertical bar in criu --help output, to be in line with the rest
of usage output.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
* Don't show long usage in case of usage error, otherwise an actual
error message will be lost in long output.
* Print error if command is not specified.
* Return 0 if criu --help is used.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
* Make "invalid usage" type messages uniform
* Use pr_msg() not pr_err(), as we don't want to clutter output
with useless information like (crtools.c:123).
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
* Introduce a generic way to report that option argument is invalid
* Switch to using it from existing places
(options --veth-pair, --port, -n)
* Check for invalid argument of -p and -t and report it.
Notes:
1) In order to correctly print long option name in case it was used
instead of a short one, I had to move "struct option long_opts"
to main() context, this is why the patch is so long.
2) pr_msg() (rather than pr_err()) is used to print errors, otherwise
it is prefixed with that (crtools.c:123) prefix which makes it
look weird.
3) Usage is not shown in case of error, otherwise an error message
is lost in output.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case open_image_dir() fails, it prints an error why,
so there is no need to print it one more time.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The -F|--fields option specifies which fields (by name, comma
separated) should be printed.
For nested fields all names in path should be specified.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Libraries (plugins) is going to be used for dumping and restoring
external dependencies (e.g. dbus, systemd journal sockets, charecter
devices, etc)
A plugin can have the cr_plugin_init() and cr_plugin_fini functions for
initialization and deinialization.
criu-plugin.h contains all things, which can be used in plugins.
v2: rename lib to plugin
v3: add a default value for a plugin path.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>