2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-30 13:58:34 +00:00
Commit Graph

514 Commits

Author SHA1 Message Date
Pavel Emelyanov
1a2e6cbd3f dump: Don't close pid-proc in vain
The open_pid_proc engine knows itself how to cache
per-pid descriptors. No need in closing it by hands.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:22:21 +04:00
Pavel Emelyanov
e651a6eba4 filemap: Get vma mnt_id early
We have a, well, issue with how we calculate the vma's mnt_id.

Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.

So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.

I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.

Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:20:55 +04:00
Pavel Emelyanov
cf8c9ae870 vma: Reshuffle the struct vma_area
We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).

Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:19:55 +04:00
Pavel Emelyanov
1ebd56b024 proc: Don't use FILE * to reach children
The same reasoning as for personality file -- switch to
plan open + read + close.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:39:56 +04:00
Pavel Emelyanov
f3bee6d584 proc: Don't use FILE* for reading personality
It turned out, that fdopen (used in fopen_proc) always maps
a 4k buffer for reads and this buffer gets unmap-ed later
on fclose.

Taking into account the amount of proc files we read (~20
per task plus one file per opened file descriptor) this
mmap+munmap result in quite a lot of useless CPU time.

E.g. for a container of 20 tasks we have 1000 calls taking
~8% of total dump time.

So lets first stop doing this for simple cases -- one line
proc files.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:39:49 +04:00
Pavel Emelyanov
17d44de9af scripts: Use numeric script names
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-05 13:48:26 +04:00
Pavel Emelyanov
069bdd9674 scripts: Move scripts code into separate sources
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-05 13:48:21 +04:00
Cyrill Gorcunov
3146f58317 plugin: Rework plugins API, v2
Here we define new api to be used in plugins.

 - Plugin should provide a descriptor with help of
   CR_PLUGIN_REGISTER macro, or in case if plugin require
   no init/exit functions -- with CR_PLUGIN_REGISTER_DUMMY.

 - Plugin should define a plugin hook with help of
   CR_PLUGIN_REGISTER_HOOK macro.

 - Now init/exit functions of plugins takes @stage
   argument which tells plugin which stage of criu
   it's been called on dump/restore. For exit it
   also takes @ret which allows plugin to know if
   something went wrong and it needs to cleanup
   own resources.

The idea behind is to not limit plugins authors with names
of functions they might need to use for particular hook.

Such new API deprecates olds plugins structure but to keep
backward compatibility we will provide a tiny layer of
additional code to support old plugins for at least a couple
of release cycles.

For example a trivial plugin might look like

 | #include <sys/types.h>
 | #include <sys/stat.h>
 | #include <fcntl.h>
 | #include <libgen.h>
 | #include <errno.h>
 |
 | #include <sys/socket.h>
 | #include <linux/un.h>
 |
 | #include <stdio.h>
 | #include <stdlib.h>
 | #include <string.h>
 | #include <unistd.h>
 |
 | #include "criu-plugin.h"
 | #include "criu-log.h"
 |
 | static int dump_ext_file(int fd, int id)
 | {
 |	pr_info("dump_ext_file: fd %d id %d\n", fd, id);
 |	return 0;
 | }
 |
 | CR_PLUGIN_REGISTER_DUMMY("trivial")
 | CR_PLUGIN_REGISTER_HOOK(CR_PLUGIN_HOOK__DUMP_EXT_FILE, dump_ext_file)

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-03 20:48:36 +04:00
Pavel Emelyanov
57c7826a8e locks: Check for --file-locks option when real locks are found
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 20:20:47 +04:00
Pavel Emelyanov
d58aafc447 dump: Don't allocate dfds in case we dump shared fdtable
After patches, that dump locks w/o dfds array, we can even
not allocate one when we don't need it.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 17:45:29 +04:00
Pavel Emelyanov
53537f52c8 locks: Don't dump locks in per-task manner (v3)
We have a problem with file locks (bug #2512) -- the /proc/locks
file shows the ID of lock creator, not the owner. Thus, if the
creator died, but holder is still alive, criu fails to dump the
lock held by latter task.

The proposal is to find who _might_ hold the lock by checking
for dev:inode pairs on lock vs file descriptors being dumped.
If the creator of the lock is still alive, then he will take
the priority.

One thing to note about flocks -- these belong to file entries,
not to tasks. Thus, when we meet one, we should check whether
the flock is really held by task's FD by trying to set yet
another one. In case of success -- lock really belongs to fd
we dump, in case it doesn't trylock should fail.

At the very end -- walk the list of locks and dump them all at
once, which is possible by merge of per-task file-locks images
into one global one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-02 17:44:46 +04:00
Saied Kazemi
d8b41b6525 Added AUFS support.
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.

The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver.  For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.

[ xemul: With AUFS files sometimes, in particular -- in case of a
  mapping of an executable file (likekely the one created at elf load),
  in the /proc/pid/map_files/xxx link target we see not the path
  by which the file is seen in AUFS, but the path by which AUFS
  accesses this file from one of its "branches". In order to fix
  the path we get the info about branches from sysfs and when we
  meet such a file, we cut the branch part of the path. ]

Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-21 18:35:22 +04:00
Pavel Emelyanov
546f2701f0 signals: Comments and while (1) loop
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:27:54 +04:00
Pavel Emelyanov
11fc475853 signals: Sanitize j loop control variable
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:27:40 +04:00
Pavel Emelyanov
f9ebd18354 signals: Don't collect siginfo_t-s on stack
We've moved signinfos on core entry, thus the bits with
siginfo-s themselves cannot sit on stack any longer.
Otherwise we would overwritem them with next batch and
will feed stack pointer to the caller, thus causing a
data and garbage on the stack to be written into image
instead of siginfo data.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:27:19 +04:00
Pavel Emelyanov
92664c5220 signals: Don't forget to allocate SiginfoEntry
The se variable is just an array of pointers on these
objects. Need to allocate the objects themselves.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:25:57 +04:00
Pavel Emelyanov
8197bae072 signals: Move nr variable into peeking loop
And sanitize its usage a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:25:13 +04:00
Pavel Emelyanov
22082b0e55 signals: Calculate peek offset in-place
No need in extra variable for that.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 15:24:36 +04:00
Ruslan Kuprieiev
68501cde88 dump: dump signals into signals_*
Every thread has it's own private signals stored at thread_core->signals_p
and leader thread has also shared signals stored at tc->signals_s.

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 13:09:47 +04:00
Ruslan Kuprieiev
aac9fd5bad dump: allocate task cores in collect_task() instead of parasite_infect_seized()
We need it to be able to dump signals into cores
before calling parasite_infect_seized().

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-19 13:09:46 +04:00
Sophie Blee-Goldman
e606c2141e Dump capabilities from the parasite
Needed for future user namespace support. Capabilities will have to be
dumped from the parasite, ie from inside the namespace since there is no
obvious way to 'translate' capabilities from the global namespace (unlike
with uids and gids, where the id mappings can be used for translation).

[ additional explanation from Andrew Vagin:

"capabilities" are not translated between namespaces. They can exist
only in one userns, where a process lives. If a process is created in a
new userns, it gets a full set of capabilities in this userns, and
loses all caps in a parent userns.

So if capabilities are not shown in /proc/pid/stat, we have no way to
get it except of using parasite code. ]

Signed-off-by: Sophie Blee-Goldman <ableegoldman@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-15 23:10:44 +04:00
Andrey Vagin
e4e22a00f7 mount: save remapped links on tmpfs (v2)
For that mnt namespaces should be dumped after files.

v2: rework enumeration of namespaces in dump_mnt_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-08-05 16:35:41 +04:00
Tycho Andersen
51876eea5d Attempt to restore cgroups
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.

On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-10 17:00:28 +04:00
Pavel Emelyanov
5e9c57a13d criu: Dump and restore pdeath_sig value
The implementation is pretty straightforward. When dumping per-thread
misc data with parasite, collect one, then write in thread_core_info.

On restore wait for creds restore and put the value back (some creds
changes drop it to zero).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2014-07-01 16:16:04 +04:00
Filipe Brandenburger
64dc66c29f dump: do not fail dump when robust_lists are disabled
Robust lists may be disabled, for example if the "futex_cmpxchg_enabled"
variable in the kernel is unset.

Detect that case by checking that both "get_robust_list" and "set_robust_list"
syscalls return ENOSYS and do not make criu dump fail in that case, but simply
assume an empty list, which is consistent with the syscalls not being
available.

Tested: Successfully ran the zdtm test suite on a kernel where the
"get_robust_list" and "set_robust_list" syscalls are disabled.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-25 19:57:32 +04:00
Cyrill Gorcunov
0bb002ce69 vdso: dump -- Don't dump contents of vvar zone
vvar zone is mapped by a kernel and must not ever
been dumped into image, the data present there is
valid on running kernel only.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 22:48:41 +04:00
Pavel Emelyanov
1ba9d2cae9 cg: Dump cgroups tasks live in
Each task points to a single ID of cgroup-set it lives in. This
is done so to save some space in the image, as tasks likely
live in the same set of cgroups.

Other than this we keep track of what cgroup set we dump the
subtree from. If it happens, that root task lives in the same
cgroup set as criu does, we don't allow for any other sub-cgroups
and make restore (next patch) much simpler and faster.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:48:06 +04:00
Pavel Emelyanov
8b8eb53a0a cg: Skeleton for cgroup code
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:48:06 +04:00
Cyrill Gorcunov
89faae1e9b vdso: dump -- Drop duplicated include
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:40:00 +04:00
Filipe Brandenburger
d5bb7e9748 dump: preserve the dumpable flag on criu dump/restore
Preserve the dumpable flag, which affects whether a core dump will be
generated, but also affects the ownership of the virtual files under
/proc/$pid after restoring a process.

Tested: Restored a process with a criu including this patch and looked
at /proc/$pid to confirm that the virtual files were no longer all owned
by root:root.

zdtm tests pass except for cow01 which seems to be broken.
(see https://bugzilla.openvz.org/show_bug.cgi?id=2967 for details.)

This patch fixes https://bugzilla.openvz.org/show_bug.cgi?id=2968

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Change-Id: I8c386508448a84368a86666f2d7500b252a78bbf
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-14 01:02:37 +04:00
Andrey Vagin
2f4be997b6 mount: use per-namespace mntinfo_tree (v2)
This patch removes the global mntinfo_tree and collect_mount_info where
it was constructed. The mntinfo list is filled from dump_mnt_ns,
rst_collect_local_mntns, collect_mnt_namespaces and read_mnt_ns_img.

A mountinfo entry contains a reference on a proper ns_id entry, so
we cau use mnt_id to look up a proper mount namespace.

v2: remove trash after rebasing.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:40:19 +04:00
Andrey Vagin
b6d3314c54 check: collect mounts of the current mntns
They are used for collecting unix sockets

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:40:04 +04:00
Andrey Vagin
e827a695f3 mount: separate collect_mnt_ns from dump_mnt_ns
We are going to support nested mntns, so the global mntinfo_tree
variable are useless and information about tree should be connected
to a proper namespace.

But when we don't dump mntns, we need to collect mounts for the current
mntns.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:41 +04:00
Andrey Vagin
de4326a382 mount: return descriptor from mntns_collect_root
We are going to support nested mount namespaces, so files can be opened
from more than one namespace and a root must be collect for each file.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:32 +04:00
Andrey Vagin
22d384536d files-ids: generate id-s accoding with mnt_id, st->st_dev and st->st_ino
One device can be mounted a few times, so files are identical only,
if they have the same mnt_id.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:28 +04:00
Andrey Vagin
87b1f5408c files: save mnt_id on fd_param
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:18 +04:00
Andrey Vagin
0721626902 namespaces: dump mount namespaces before tasks (v2)
because we want to check, that all files are reachable.
For that we need to collect all mounts from all namespaces.

v2: dump mntns separately
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:38:47 +04:00
Andrey Vagin
d2012883ab criu: rename current_ns_mask to root_ns_mask (v2)
Now we supports sub-mntns, so root_ns_mask sounds more correct than
current_ns_mask.

v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:38:33 +04:00
Pavel Emelyanov
1d438db66d rlimits: Move entries from top-core into task-core
This appeared after latest 1.2, so it's still possible
to do this move.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:01:08 +04:00
Pavel Emelyanov
b54e340945 core: Move posix timers on core entry
This as well gives us minus one image per-task and
allocates more space on core task entry.

One thing to note -- the amount of posix timers is
not easily accessible at the core entry allocation
time, so the respective array is allocated on demand.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:00:54 +04:00
Pavel Emelyanov
dfd5a62f38 core: Move itimers on core
This allows to have one image less per-task, which in turn
reduces live migration time a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:00:52 +04:00
Christopher Covington
c1cd6b5e5f Allow dumps of stopped multithreaded processes
CRIU can handle stopped multithreaded processes when all threads
are stopped. Refine the check to allow this case.

Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-11 15:16:18 +04:00
Andrey Vagin
4e1d81deb6 cr-dump: allocate dfds near the place where it's used
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-08 22:55:57 +04:00
Jamie Liu
efe594f8f4 criu: fix filemap open permissions
An mmaped file is opened O_RDONLY or O_RDWR depending on the permissions
on the first vma dump_task_mm() encounters mapping that file. This
causes two problems:

1. If a file has multiple MAP_SHARED mappings, some of which are
   read-only and some of which are read-write, and the first encountered
   mapping happens to be read-only, the file will be opened O_RDONLY
   during restore, and mmap(PROT_WRITE) will fail with EACCES, causing
   the restore to fail.

2. If a file is opened read-write and mapped read-only, it will be
   opened O_RDONLY during restore, so restore will succeed, but
   mprotect(PROT_WRITE) on the read-only mapping after restore will
   fail.

To fix both of these, record open flags per-vma based on the presence of
VM_MAYWRITE in smaps.

Signed-off-by: Jamie Liu <jamieliu@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-04 20:35:48 +04:00
Cyrill Gorcunov
5f433a6e81 pstree: Define RLIM_NLIMITS
On PI machine we've got

 |   CC	   protobuf.o
 | pstree.c: In function ‘core_entry_alloc’:
 | pstree.c:36:10: error: ‘RLIM_NLIMITS’ undeclared (first use in this function)

due to old kernel headers. Note I've dropped off
BUG_ON here to localize all things in pstree code,
no need to sprinkle constants.

Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-14 18:57:16 +04:00
Pavel Emelyanov
57825f6500 rlims: Unscrew up core->rlimits[i] assignment
The array element is RlimitEntry properly initialized,
no need in additional memcpy-s and size-checks.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-14 15:46:42 +04:00
Cyrill Gorcunov
79a88ae0dd rlimit: Stop writting old rlimits image entries
We're using new image format, but old image file
is still generated. This will be addressed in
next patch.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-14 15:44:46 +04:00
Cyrill Gorcunov
fd82384866 rlimit: Dump task rlimits into Core entry
Note the restore remains as is for a while, it'll
be addressed later.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-14 15:44:34 +04:00
Cyrill Gorcunov
ac03ca5599 auxv: Restore backward compatibility
In commit 459828b6 I suddenly broke backward
compatibility of auxv vector on 32bit machines.
Bring it back.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-11 09:18:07 +04:00
Cyrill Gorcunov
459828b6be dump: Read aux vector in one pass
No need to read it in cycle.

Repored-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-10 14:26:29 +04:00