2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-27 12:28:14 +00:00

1696 Commits

Author SHA1 Message Date
Ruslan Kuprieiev
5e58a5dc9f crtools: check for setproctitle_init
Check for setproctitle_init, as old versions of libbsd don't have one.

Reported-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Acked-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:14:39 +04:00
Ruslan Kuprieiev
2144583732 include: add setproctitle.h
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Acked-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:14:37 +04:00
Andrey Vagin
5ed2004733 dump: clean up shared_fdtable
It's cleaned up accoding with following statements:
* files_id can't be zero (look at dump_task_kobj_ids)
* item->ids is allocated for all non-dead tasks
* a parent can't be dead

In addition here is a tiny coding stype fix.

Fixes: 475bb1e77522 ("rst: Evaluate per-task clone mask early")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:10:14 +04:00
Andrey Vagin
33c75d0df9 eventpoll: parse_fdinfo_pid_s() returns allocated object for eventpol tfd
We are going to collect all objects in a list and write them into
the eventpoll image. The eventpoll tfd image will be depricated.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:08:17 +04:00
Andrey Vagin
78a54bd87c fsnotify: parse_fdinfo_pid_s() returns allocated object for fanotify marks
We are going to collect all objects in a list and write them into
the fanotify image. The fanotify mark image will be depricated.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:07:44 +04:00
Andrey Vagin
7079bb1086 fsnotify: parse_fdinfo_pid_s() returns allocated object for inotify wd (v2)
We are going to collect all objects in a list and write them into
the inotify image. The inotify wd image will be depricated.

v2: cb() must always free an entry
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 16:07:43 +04:00
Saied Kazemi
9eec8b03af Use --root instead of --aufs-root
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root.  This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-27 14:31:40 +04:00
Saied Kazemi
d8b41b6525 Added AUFS support.
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.

The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver.  For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.

[ xemul: With AUFS files sometimes, in particular -- in case of a
  mapping of an executable file (likekely the one created at elf load),
  in the /proc/pid/map_files/xxx link target we see not the path
  by which the file is seen in AUFS, but the path by which AUFS
  accesses this file from one of its "branches". In order to fix
  the path we get the info about branches from sysfs and when we
  meet such a file, we cut the branch part of the path. ]

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-21 18:35:22 +04:00
Ruslan Kuprieiev
9b2d1774ba image: mark CR_FD_SIGNAL and CR_FD_PSIGNAL as obsoleted and don't create signal-s*.img, v2
After this patch, signal-s*.img won't be created.

v2: just move them to the end of array

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 13:09:49 +04:00
Pavel Emelyanov
f781ba0466 rst: Rework task_entries to use rst_mem engine
The task_entries is a small structure used to coordinate the
processes restore stages. Currentl we allocate one page for
it and handle one separately. No need in this complexity, actually.
The rst_mem engine is already capable to controll this small object.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 13:00:10 +04:00
Pavel Emelyanov
5f9acc8dc9 shmem: Explicitly initialize rst_shmems
This is a position in the RM_SHREMAP memory. Since shmems are currently
the only user of it, this is validly equals zero, but it will change soon.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 13:00:07 +04:00
Tycho Andersen
94f6c87c9f cg: add --cgroup-root option
The motivation for this is to be able to restore containers into cgroups other
than what they were dumped in (if, e.g. they might conflict with an existing
container). Suppose you have a container in:

memory:/mycontainer
cpuacct,cpu:/mycontainer
blkio:/mycontainer
name=systemd:/mycontainer

You could then restore them to /mycontainer2 via --cgroup-root /mycontainer2.
If you want to restore different controllers to different paths, you can
provide multiple arguments, for example, passing:

--cgroup-root /mycontainer2 --cgroup-root cpuacct,cpu:/specialcpu \
--cgroup-root name=systemd:/specialsystemd

Would result in things being restored to:

memory:/mycontainer2
cpuacct,cpu:/specialcpu
blkio:/mycontainer2
name=systemd:/specialsystemd

i.e. a --cgroup-root without a controller prefix specifies the new default root
for all cgroups.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 12:58:36 +04:00
Sophie Blee-Goldman
e606c2141e Dump capabilities from the parasite
Needed for future user namespace support. Capabilities will have to be
dumped from the parasite, ie from inside the namespace since there is no
obvious way to 'translate' capabilities from the global namespace (unlike
with uids and gids, where the id mappings can be used for translation).

[ additional explanation from Andrew Vagin:

"capabilities" are not translated between namespaces. They can exist
only in one userns, where a process lives. If a process is created in a
new userns, it gets a full set of capabilities in this userns, and
loses all caps in a parent userns.

So if capabilities are not shown in /proc/pid/stat, we have no way to
get it except of using parasite code. ]

Signed-off-by: Sophie Blee-Goldman <ableegoldman@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-15 23:10:44 +04:00
Sophie Blee-Goldman
3faaed2f64 Bug-fix in size calculation
Fixes a bug in how PARASITE_MAX_GROUPS was calculated, and adds a
compiler check to assert that parasite_dump_creds doesn't exceed
the page size.

Signed-off-by: Sophie Blee-Goldman <ableegoldman@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-13 13:04:58 +04:00
Pavel Emelyanov
548625132d pstree: Introduce task_alive() helper
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-12 14:41:00 +04:00
Pavel Emelyanov
7960379f71 flock: Merge all file lock entries into single image file
They are now in per-pid images, but every entry contains a
pid to which it "belongs". This belonging is fake -- it's
just a pid of a task who placed the lock, while locks really
belong to files. We even have a bug when task that locked
a file exited and "delegated" the lock to its child.

This images merge reduces the amount of image files criu
generates and may simplify the fix of mentioned above issue.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-12 14:38:49 +04:00
Pavel Emelyanov
4816882da9 img: Add ability to check whether optional image collection happened
A bit later we'd need to check whether cinfo collector
opened an image or not due to file absense.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-12 14:38:22 +04:00
Tycho Andersen
f95b05eb75 opts: add --manage-cgroups option
criu managed cgroups is now an opt-in thing, so by default criu does not manage
(i.e. dump or restore) cgroups. This allows users to use the previous behavior.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-12 14:32:50 +04:00
Garrison Bellack
4c7bc7678e Cgroup property restoration infrastructure
Restores 2 cgroup properties after the criu restoration of tasks.
Currently the cgroup files to be restored are static but
are easily extendable. To change the properties to be restored,
edit this list at the top of cgroup.c. If a cgroup exists during
restoration, its properties will not be overwritten.
Work based off Tycho Anderson tycho.andersen@canonical.com

Change-Id: Ida32b9773eeac1d4d6e82ad644524ed099d5f9b1
Signed-off-by: Garrison Bellack <gbellack@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-08 17:06:08 +04:00
Cyrill Gorcunov
7158448dd6 timerfd: Implement check routine
Reported-by: Jenkins
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-07 10:18:09 +04:00
Cyrill Gorcunov
ecd432fe27 timerfd: Implement c/r procedure
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 19:20:09 +04:00
Cyrill Gorcunov
5c93ba3b7b timerfd: Add protobuf entries into the image
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 19:18:34 +04:00
Andrey Vagin
339f456af3 link-remap: open link-remap files from correct mountpoints (v3)
Here is a problem with ghost files. Links are created on restore, but
they can't be created on any mount point, because a mount point can be
non-root bind-mount of another one. So we need to find the root mount
and create all links there.

v2: clean up
v3: add optimization for the case when both links on the same mount
point.
v4: don't look up mount points by mnt_id in a second time.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 19:14:16 +04:00
Andrey Vagin
ce5aa74d10 mount: save local mount point paths on restore
On restore we add a temporary root to a mount point path. It's convinient
for restoring mount namespaces, but real paths are used for restoring
link-remap files.

v2: replace the offset field on a char * field

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 19:14:15 +04:00
Pavel Emelyanov
5289ea973a mnt: Extend comment about how mntinfo->mountpoint path looks like
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 12:04:22 +04:00
Pavel Emelyanov
9fd793e565 stat: Pass namespace into phys_stat_resolve_dev, not mnt tree
This makes the API simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 10:57:27 +04:00
Pavel Emelyanov
090587e1a1 stat: Pass namespace into phys_stat_dev_match, not mnt tree
This makes the API simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 10:57:25 +04:00
Ruslan Kuprieiev
2b268c6c21 security: check additional groups,v5
Currently, we only check if process gids match primary gid of user.
But process and user have additional groups too. So lets:
     1) check that process rgid,egid and sgid are in the user's grouplist.
     2) on restore check that user has all groups from the images.

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-06 10:20:27 +04:00
Andrey Vagin
967dba606a mount: add helper mntns_get_root_by_mnt_id
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-05 16:38:19 +04:00
Cyrill Gorcunov
18fe357563 vdso: Implement vDSO proxification of any vvar/vdso order
In latest linux-next the vdso zone is placed _after_ vvar
zone so eventually we need to handle any combination of
the following cases

 - no vvar zone
 - vvar before vdso
 - vvar after vdso

Here we address all them.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-04 15:35:03 +04:00
Cyrill Gorcunov
6446fd2c1d vdso: Move parking into a separate routine
Since we might have a several vDSO zones lets hide
handling in arch-specific routines.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-04 15:34:34 +04:00
Pavel Emelyanov
9b6c41f2a0 cg: Remove unused cgroup_dir field
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
2014-07-15 17:29:23 +04:00
Tycho Andersen
0f178a1f99 cg: correctly detect co-mounted controller mount point
Before we would not detect the mount point for co-mounted controllers. Things
still worked because we'd just re-mount them ourselves and traverse our own
mount point, but this saves an extra mount().

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-14 15:14:37 +04:00
Tycho Andersen
51876eea5d Attempt to restore cgroups
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.

On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-10 17:00:28 +04:00
Pavel Emelyanov
a919dbc9c6 files: Fix restoration of ghost cwd (and root)
When cwd is removed (it can be) we need to collect the respective
file_desc before starting opening any files to properly handle
ghost refcounts. Otherwise we will miss one refcount from the
cwd's on ghost, which in turn will either BUG inside ghost removal,
or will fail the cwd due to the respective dir being removed too
early.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:09:06 +04:00
Pavel Emelyanov
ba8671b4c1 files: Split open_reg_by_id into two parts
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:09:04 +04:00
Pavel Emelyanov
9b91bf390d files: Split fs restore into prepare and restore
The prepare one will become more complicated soon.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:09:03 +04:00
Pavel Emelyanov
b8d01d1b7a files: Rename prepare_fs into restore_fs
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:09:02 +04:00
Pavel Emelyanov
d0097b2db0 files: Support ghost directories restore
If we have opened and rmdir-ed directory, the dump works OK
creating the ghost file and remap, but restore creates _file_
instead of directory.

Fix this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:08:59 +04:00
Pavel Emelyanov
84eb0a1927 criu: Restore tasks as siblings in swrk
Andrey validly pointed out, that restoring pdeath_sig is not
compatible with criu_restore_child() call -- after criu restore
children, it will exit and fire the pdeath_sig into restored
tree root, potentially killing it.

The fix for that could be -- when started in swrk more, criu can
restore tree not as children tasks, but as siblings, using the
CLONE_PARENT flag when fork()-ing the root task.

With this we should also take care about errors handing -- right
now criu catches the SIGCHILD from dying children tasks, and
since we plan to create them be children of the criu parent (the
library caller) we will not be able to catch them. To do so we
SEIZE the root task in advance thus causing all SIGCHLD-s go to
criu, not to its parent.

Having this done we no longer need the SUBREAPER trick in the
library call -- tasks get restored right as callers kids :)

Some thoughts for future -- using this trick we can finally make
"natural" restoration of shell jobs. I.e. -- make criu restore
some subtree right under bash, w/o leaving itself as intermediate
task and w/o re-parenting the subtree to init after restore.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>
2014-07-01 16:16:07 +04:00
Pavel Emelyanov
5e9c57a13d criu: Dump and restore pdeath_sig value
The implementation is pretty straightforward. When dumping per-thread
misc data with parasite, collect one, then write in thread_core_info.

On restore wait for creds restore and put the value back (some creds
changes drop it to zero).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2014-07-01 16:16:04 +04:00
Pavel Emelyanov
d30521a3cf crtools: Add internal "swrk" action
To help restoring tasks from images as kids to the caller, we can
do the trick.

1. Caller sets himself as child reaper with PR_SET_CHILD_SUBREAPER prctl
2. Caller makes sure criu binary is suid-ed and owned by root
3. Caller forks and calls execv() on criu asking it to restore
4. Criu finishes restore and exits. All its kids get reparented to the
   criu's parent, i.e. -- to the library caller.
5. Caller stops being subreaper

In order to make the execv() and arguments passing simpler I propose
to execv() the service worker function, that accepts options via socket.

This is good for two reasons.

1. We don't have to construct CLI options in libcriu
2. We reuse other service's facilities, such as security checks,
   ability to dump, pre-dump and other stuff

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-27 14:24:33 +04:00
Pavel Emelyanov
fac7befa6b files: Sanity check for reg file on restore is not corrupted
When opening a reg file on restore -- check that the file size we
opened matches the on we saw on dump. This is not bullet-proof protection,
but is helpful to protect against FS updates between dump/restore.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 23:38:48 +04:00
Cyrill Gorcunov
fe7b8aeb8c vdso: x86 -- Add handling of vvar zones
New kernel 3.16 will have old vDSO zone splitted into the two vmas:
one for vdso code itself and second that named vvar for data been
referenced from vdso code.

Because I can't do 'dump' and 'restore' parts of the code separately
(otherwise test would fail) the commit is pretty big one and hard to
read so here is detailed explanation what's going on.

 1) When start dumping we detect vvar zone by reading /proc/pid/smap
    and looking up for "[vvar]" token. Note the vvar zone is mapped
    by a kernel with PF/IO flags so we should not fail here.

    Also it's assumed that at least for now kernel won't be changed
    much and [vvar] zone always follows the [vdso] zone, otherwise
    criu will print error.

 2) In previous commits we disabled dumping vvar area contents so
    the restorer code never try to read vvar data but still we need
    to map vvar zone thus vma entry remains in image.

 3) As with previous vdso format we might have 2 cases

    a) Dump and restore is happening on same kernel
    b) Dump and restore are done on different kernels

    To detect which case we have we parse vdso data from image
    and find symbols offsets then compare their values with runtime
    symbols provided us by a kernel. If they match and (!!!) the
    size of vvar zone is the same -- we simply remap both zones
    from runtime kernel into the positions dumpee had at checkpoint
    time. This is that named "inplace" remap (a).

    If this happens the vdso_proxify() routine drops VMA_AREA_REGULAR
    from vvar area provided by a caller code and restorer won't try
    to handle this vma. It looks somehow strange and probably should
    be reworked but for now I left it as is to minimize the patch.

    In case of (b) we need to generate a proxy. We do that in same
    way as we were before just include vvar zone into proxy and save
    vvar proxy address inside vdso mark injected into vdso area. Thus
    on subsequent checkpoint we can detect proxy vvar zone and rip
    it off the list of vmas to handle.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 22:48:43 +04:00
Cyrill Gorcunov
154d1c6c2c vdso: parasite -- Prepare new vdso mark structure.
Because of new vvar area we need to carry the
address of vvar proxy inside the mark. Thus
add members needed and update routines.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 22:48:43 +04:00
Cyrill Gorcunov
72ead490e4 vdso: image -- Add VMA_AREA_VVAR flag
Will need it to handle vvar zones in a special way.

Because VMA_UNSUPP never goes into the image file
lets reuse bit 12 for VVAR.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 22:48:40 +04:00
Pavel Emelyanov
3b995f1aef iov: Add iovec2pagemap() helper
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-20 16:35:52 +04:00
Andrey Vagin
494c044384 mount: dump one file system only once (v2)
A file system can be bind-mounted a few times and some of these mounts
can be non-root. We need to find one of root mounts and dump it.

v2: don't forget to check pm->dumped and pm->parent
    don't dump a root file system, it's always external for now.

Reported-by: Saied Kazemi <saied@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-17 10:40:00 +04:00
Andrey Vagin
697211908a tmpfs: use device number instead of mnt_id in image names
One file system can be mounted a few times, so mnt_id isn't unique for it.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-17 10:39:52 +04:00
Pavel Emelyanov
c7e0042946 crtools: Introduce the --ext-mount-map option (v3)
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.

On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.

v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-17 10:36:30 +04:00