2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-31 06:15:24 +00:00
Commit Graph

594 Commits

Author SHA1 Message Date
Pavel Emelyanov
b8d01d1b7a files: Rename prepare_fs into restore_fs
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-04 15:09:02 +04:00
Pavel Emelyanov
b429492dbc rst: Include criu/include/ptrace.h instead of system one
On ARM some PTRACE_... constants are not declared in sys/ptrace.h file.
They are in linux/ptrace.h, but on x86 this file somewhat conflicts with
the sys/ one. For now fix ARM compilation by using criu/ one and think
of it later.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-07-01 19:48:23 +04:00
Pavel Emelyanov
84eb0a1927 criu: Restore tasks as siblings in swrk
Andrey validly pointed out, that restoring pdeath_sig is not
compatible with criu_restore_child() call -- after criu restore
children, it will exit and fire the pdeath_sig into restored
tree root, potentially killing it.

The fix for that could be -- when started in swrk more, criu can
restore tree not as children tasks, but as siblings, using the
CLONE_PARENT flag when fork()-ing the root task.

With this we should also take care about errors handing -- right
now criu catches the SIGCHILD from dying children tasks, and
since we plan to create them be children of the criu parent (the
library caller) we will not be able to catch them. To do so we
SEIZE the root task in advance thus causing all SIGCHLD-s go to
criu, not to its parent.

Having this done we no longer need the SUBREAPER trick in the
library call -- tasks get restored right as callers kids :)

Some thoughts for future -- using this trick we can finally make
"natural" restoration of shell jobs. I.e. -- make criu restore
some subtree right under bash, w/o leaving itself as intermediate
task and w/o re-parenting the subtree to init after restore.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>
2014-07-01 16:16:07 +04:00
Pavel Emelyanov
5e9c57a13d criu: Dump and restore pdeath_sig value
The implementation is pretty straightforward. When dumping per-thread
misc data with parasite, collect one, then write in thread_core_info.

On restore wait for creds restore and put the value back (some creds
changes drop it to zero).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2014-07-01 16:16:04 +04:00
Cyrill Gorcunov
fe7b8aeb8c vdso: x86 -- Add handling of vvar zones
New kernel 3.16 will have old vDSO zone splitted into the two vmas:
one for vdso code itself and second that named vvar for data been
referenced from vdso code.

Because I can't do 'dump' and 'restore' parts of the code separately
(otherwise test would fail) the commit is pretty big one and hard to
read so here is detailed explanation what's going on.

 1) When start dumping we detect vvar zone by reading /proc/pid/smap
    and looking up for "[vvar]" token. Note the vvar zone is mapped
    by a kernel with PF/IO flags so we should not fail here.

    Also it's assumed that at least for now kernel won't be changed
    much and [vvar] zone always follows the [vdso] zone, otherwise
    criu will print error.

 2) In previous commits we disabled dumping vvar area contents so
    the restorer code never try to read vvar data but still we need
    to map vvar zone thus vma entry remains in image.

 3) As with previous vdso format we might have 2 cases

    a) Dump and restore is happening on same kernel
    b) Dump and restore are done on different kernels

    To detect which case we have we parse vdso data from image
    and find symbols offsets then compare their values with runtime
    symbols provided us by a kernel. If they match and (!!!) the
    size of vvar zone is the same -- we simply remap both zones
    from runtime kernel into the positions dumpee had at checkpoint
    time. This is that named "inplace" remap (a).

    If this happens the vdso_proxify() routine drops VMA_AREA_REGULAR
    from vvar area provided by a caller code and restorer won't try
    to handle this vma. It looks somehow strange and probably should
    be reworked but for now I left it as is to minimize the patch.

    In case of (b) we need to generate a proxy. We do that in same
    way as we were before just include vvar zone into proxy and save
    vvar proxy address inside vdso mark injected into vdso area. Thus
    on subsequent checkpoint we can detect proxy vvar zone and rip
    it off the list of vmas to handle.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-06-24 22:48:43 +04:00
Pavel Emelyanov
3659d60ab7 restore: Open /proc/sys/kernel/ns_last_pid via helpers
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-06-09 15:29:49 +04:00
Pavel Emelyanov
203c291467 cg: Restore tasks into proper cgroups
On restore find out in which sets tasks live in and move
them there.

Optimization note -- move tasks into cgroups _before_ fork
kids to make them inherit cgroups if required. This saves
a lot of time.

Accessibility note -- when moving tasks into cgroups don't
search for existing host mounts (they may be not available)
and don't mount temporary ones (may be impossible due to
user namespaces). Instead introduce service fd with a yard
of mounts.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:48:06 +04:00
Pavel Emelyanov
8b8eb53a0a cg: Skeleton for cgroup code
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:48:06 +04:00
Cyrill Gorcunov
676708e3b3 vdso: Put CONFIG_VDSO where needed
Guard vDSO code with CONFIG_VDSO, no need to even build it
on archs which do not support vDSO handling.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-27 23:40:07 +04:00
Andrey Vagin
1a1b50168d vma: don't skip vmas during searching a parent vma
We stop searching if vma->start is bigger than a required one.
The coursor is set on the last examined vma. When we are searching a
parent vma for the next vma, we start examine vma-s starting from
coursor->next, so we don't examine the vma, which is pointed by cursor.

This patch replaces list_for_each_entry_continue on list_for_each_entry_from.

Reported-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-14 01:00:31 +04:00
Andrey Vagin
c838494034 mem: stop searching a parent vma if we found one
A parent vma can be only one.

Fixes: 57d25e7cea ("mm: fix expression to determine which vma-s can be shared")
Reported-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-14 01:00:22 +04:00
Andrey Vagin
b57497ade5 mem: fix typo in determining an address of parent vma
Look at this hunk from 7659c995f5:
-    paddr = decode_pointer(vma_premmaped_start(&p->vma));
+    paddr = decode_pointer(vma->premmaped_addr);

Obviously we want to use p->premmaped_addr instead of
vma->premmaped_addr.

Fixes: 7659c995f5 ("vm: don't overwrite vma->shmid for private mappings")
Reported-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-05-14 01:00:13 +04:00
Andrey Vagin
20003300d8 mount: remove extra calls of mntns_collect_root()
Now mntns_collect_root() should be called each time when we need to get
a root of a specified namespace and we don't need to call it for
initializing the global variable.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-22 23:49:43 +04:00
Pavel Emelyanov
79f3e90856 rst: Less arguments to restore_task_mnt_ns
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-22 23:48:46 +04:00
Pavel Emelyanov
8550f52017 mnt: Move local mntns collecting on restore into prepare_mnt_ns
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-22 23:48:43 +04:00
Andrey Vagin
2f4be997b6 mount: use per-namespace mntinfo_tree (v2)
This patch removes the global mntinfo_tree and collect_mount_info where
it was constructed. The mntinfo list is filled from dump_mnt_ns,
rst_collect_local_mntns, collect_mnt_namespaces and read_mnt_ns_img.

A mountinfo entry contains a reference on a proper ns_id entry, so
we cau use mnt_id to look up a proper mount namespace.

v2: remove trash after rebasing.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:40:19 +04:00
Andrey Vagin
5418938ec3 resotre: collect mounts of current mntns
It's required for restoring in the current mntns.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:46 +04:00
Andrey Vagin
de4326a382 mount: return descriptor from mntns_collect_root
We are going to support nested mount namespaces, so files can be opened
from more than one namespace and a root must be collect for each file.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:32 +04:00
Andrey Vagin
36a9f31665 restore: close PROC_FD_OFF before calling sigreturn
mntns_collect_root() uses PROC_FD_OFF, so we need to close it.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:39:30 +04:00
Andrey Vagin
d2012883ab criu: rename current_ns_mask to root_ns_mask (v2)
Now we supports sub-mntns, so root_ns_mask sounds more correct than
current_ns_mask.

v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:38:33 +04:00
Andrey Vagin
3a291e33ff crtools: restore nested mount namespaces (v2)
Known issue:
* currently only namespaces with the same root is supported
* nested namespaces can be dumped and restored only if the root task
  has own mount namespace.

All nested namespaces are restored in a root namespace in temporary
directories. All mount points restored in one tree and then they are
divided into namesaces.
The task with minimal pid for each namespaces unshared mntns and
then it makes pivot_root in a proper temporary directory. All other
tasks makes setns to enter into a mount namespace of the task with
minimal pid.

v2: clean up

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:38:17 +04:00
Andrey Vagin
e7e9c2ee6e mounts: create a temporary directory for restoring non-root mntns (v2)
All non-root namespaces will be restored as sub-trees of the root tree.

This patch adds helpers to create a temporary directory and mount tmpfs
in it, then create directories for each non-root mount namespace.

tmpfs is quite useful here to simplify destroying this construction,
we don't need to unmount each namespace separately.

v2: add a comment why MNT_DETACH is not dangerous here
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-21 22:38:12 +04:00
Pavel Emelyanov
e8ac085af8 Revert "crtools: close all desriptors only for the root task"
We have a race. Consider we have 3 tasks, A, B and C. A and B
share fdtable, C -- does not. Then we might be in a situation
when A is restoring memory reading mem images, and B -- forking
the C child. In that case descriptors held by A (for mem restore)
will be inherited by C and will not get closed.

This reverts commit d36e07aabe.
2014-04-21 14:48:05 +04:00
Andrey Vagin
946eadd598 mount: open_mount uses __open_mountpoint instead of own logic
Now we have two funсtions which do mostly the same, so this patch merges
them.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:03:15 +04:00
Pavel Emelyanov
302591aa05 rlimits: Reshuffle new and legacy restoration code
Do the same as was done with timers.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:01:10 +04:00
Pavel Emelyanov
1d438db66d rlimits: Move entries from top-core into task-core
This appeared after latest 1.2, so it's still possible
to do this move.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:01:08 +04:00
Pavel Emelyanov
35e560de00 timers: Reshuffle new and legacy restoration code
Make explicit checks and helpers for legacy images.
This should facilitate its removal some day in the
future.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:01:06 +04:00
Pavel Emelyanov
87da9b83ce posix-timers: Clean restore code flow
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:01:04 +04:00
Pavel Emelyanov
b54e340945 core: Move posix timers on core entry
This as well gives us minus one image per-task and
allocates more space on core task entry.

One thing to note -- the amount of posix timers is
not easily accessible at the core entry allocation
time, so the respective array is allocated on demand.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:00:54 +04:00
Pavel Emelyanov
dfd5a62f38 core: Move itimers on core
This allows to have one image less per-task, which in turn
reduces live migration time a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-17 12:00:52 +04:00
Andrey Vagin
70d9780ddd restore: call close_old_fds() after mounting /proc
A process can be restored in a new pidns. close_old_fds() opens
the /proc/PID directory. Without this patch we can see errors like this:
(00.333915)      1: Error (util.c:102): Unable to close fd 6: Bad file descriptor

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-11 16:50:16 +04:00
Andrey Vagin
d36e07aabe crtools: close all desriptors only for the root task
For all other tasks only unsed service descriptors will be closed.

This change allows to have file descriptors, which may be used for
restoring namespaces. All non-server descriptors must be closed before
restoring files.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-09 15:50:40 +04:00
Jamie Liu
288cf51741 restore: mutate tgt_addr in map_private_vma
prepare_mappings() uses the return value of map_private_vma() for the
size of the mapped vma. Unfortunately the return value of
map_private_vma() is an int, resulting in breakage when the size exceeds
31 bits. Change map_private_vma() to return only an error code, and
mutate addr in-place.

Signed-off-by: Jamie Liu <jamieliu@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-02 15:52:49 +04:00
Andrey Vagin
a7fb6a1f41 restore: call post-restore scripts before network-unlock
post-restore script can fail, so it can't be called after network-unlock.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-04-01 11:21:36 +04:00
Deyan Doychev
69a6bf4439 criu: Add exec-cmd option (v3)
The --exec-cmd option specifies a command that will be execvp()-ed on successful
restore. This way the command specified here will become the parent process of
the restored process tree.

Waiting for the restored processes to finish is responsibility of this command.

All service FDs are closed before we call execvp(). Standad output and error of
the command are redirected to the log file when we are restoring through the RPC
service.

This option will be used when restoring LinuX Containers and it seems helpful
for perf or other use cases when restored processes must be supervised by a
parent.

Two directions were researched in order to integrate CRIU and LXC:

1. We tell to CRIU, that after restoring container is should execve()
   lxc properly explaining to it that there's a new container hanging
   around.

2. We make LXC set himself as child subreaper, then fork() criu and ask
   it to detach (-d) from restore container afterwards. Being a subreaper,
   it should get the container's init into his child list after it.

The main reason for choosing the first option is that the second one can't work
with the RPC service. If we call restore via the service then criu service will
be the top-most task in the hierarchy and will not be able to reparent the
restore trees to any other task in the system. Calling execve from service
worker sub-task (and daemonizing it) should solve this.

Signed-off-by: Deyan Doychev <deyandoichev@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-25 01:20:02 +04:00
Tikhomirov Pavel
670d1ce856 v2 page-read: rework open_page_read to use in shmem restore
Signed-off-by: Tikhomirov Pavel <snorcht@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-18 11:48:58 +04:00
Cyrill Gorcunov
1153f225ff image: Add O_OPT when trying to open optional image files
During the time some files become obsolete and might be missing
in checkpoint image set, but to keep backward compatibility we
still trying to open them, which might print out error like

 | Unable to open 'path-to-file'

and confuse a reader why criu prints error but continue working.

To eliminate this problem O_OPT flag has been introduced in
commit 16b5692061, which suppress error message priting
if the flag is set.

Now start using O_OPT in the following functions

 - open_irmap_cache: irmap cache is relatively new optional feature

 - prepare_rlimits, open_signal_image, restore_file_locks,
   prepare_fd_pid, prepare_mm_pid, collect_image: all these
   helpers are trying to open image files which can be missing.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-17 14:21:21 +04:00
Cyrill Gorcunov
b478b2fb2f rlimit: Restore rlimist from Core data
To save backward compatibility try to read
data from old image if Core entry doesn't
has rlimits bound.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-03-14 15:44:42 +04:00
Pavel Emelyanov
391e4bd7b9 page-read: Sanitize opening routines
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-28 15:19:19 +04:00
Tikhomirov Pavel
f0a6d32cd4 v2 deduplication: add auto-dedup on restore
if option --auto-dedup is set on restore, then as soon as page is
restored it will be punched from the image.

open image in O_RDWR mode

Signed-off-by: Tikhomirov Pavel <snorcht@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-28 14:11:49 +04:00
Pavel Emelyanov
dc7abdfb92 vma: Don't lookup file_desc for vma twice
We do it first -- on collect, second -- on restore. The
2nd lookup is excessive, we can put fd pointer on vm_area
at lookup and reuse one later.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-07 13:51:29 +04:00
Pavel Emelyanov
fd41201975 restore: Parse /proc/self/maps for self mappings
On restore we only need to know currnet task mappings' start and end
to find where to put the restorer blob. And since the smaps file in
/proc/pid is up to 3 times slower, than the maps one, it makes
perfect sense just to parse the latter one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-07 13:32:21 +04:00
Pavel Emelyanov
18a5c90c3b collect: Add comment describing collect order
With packed reg-files we have a complex fd - file - vma - remap interaction. I
think this should be reflected in the code comment.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-05 16:18:29 +04:00
Pavel Emelyanov
74c3cc1996 rst: Introduce post-restore action
Useful to test restore time -- just abort restore with this
action and that's it.

Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-04 20:54:51 +04:00
Pavel Emelyanov
d8071ffd1a stats: Fix restore pages stats
We errorneously report nr_compared as total number of restored pages.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-04 14:03:10 +04:00
Pavel Emelyanov
72e462ad67 mm: Read mmentry early
We'll merge mm and vma images, so mm should be read in the
same place where vmas are.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-04 11:44:04 +04:00
Pavel Emelyanov
eb1ae0a025 vma: Turn embeded VmaEntry on vma_area into pointer
On restore we will read all VmaEntries in one big MmEntry object,
so to avoif copying them all into vma_areas, make them be pointable.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-04 11:44:01 +04:00
Pavel Emelyanov
446fdd7200 rst: Collect VmaEntries only once on restore
Right now we do it two times -- on shmem prepare and
on the restore itself. Make collection only once as
we do for fdinfo-s -- root task reads all stuff in and
populates tasks' rst_info with it.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-03 23:35:03 +04:00
Pavel Emelyanov
0786f831d7 mem: Move shmem preparation routine and rename
We'll collect VmaEntries early before fork.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-03 23:34:12 +04:00
Pavel Emelyanov
c8d5f1a215 vma: Use vma_area_is helper where appropriate
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-02-03 17:22:03 +04:00