Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().
We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().
We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.
v2: In the first version a breakpoint is set after sigreturn. In this
case we have a problem with signals. If a process has pending signals,
it will start to precess them after exiting from sigreturn(), but before
returning to userspace. So the breakpoint will not be triggered.
And at the end Here are a few numbers how we catch sigreturn.
Before this patch criu executes 36 syscalls and gets 12 signals.
With this patch criu executes 18 syscalls and gets 5 signals.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The control socket has enough buffer for one command and the target
process will not wait a new command, so we will avoid extra context
switches.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Test maps 17 pages and mlocks them, then changes user id from root
to 18943, after c/r checks that MAP_LOCKED bit is set for that vma.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If a file like /proc/20/mountinfo is open, but 20 is a zombie (or doesn't exist
any more), we can't read this file at all, so a link remap won't work. Instead,
we add a new remap, called the dead process remap, which forks a TASK_HELPER as
that dead pid so that the restore task can open the new /proc/20/mountinfo
instead.
This commit also adds a new stage CR_STATE_RESTORE_SHARED. Since new
TASK_HELPERS are added when loading the shared resource images, we need to wait
to start forking tasks until after these resources are loaded.
v2: fix a mutex bug
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In order to use TASK_HELPERS to open files from dead processes, they should
persist until criu is done restoring the filesystem, which happens in the
RESTORE stage. To do this, we need to pass each helper's PIDs to the restorer
blob, so that it can wait() on them when the restore stage is done.
This commit is in preparation for the remap_dead_pid commits.
v2: wait() on helpers after restore stage is over
v3: add CR_STATE_RESTORE_FS stage
v4: CR_STATE_RESTORE_FS waits for nr_tasks + nr_helpers, not nr_threads
v5: ditch CR_STATE_RESTORE_FS in favor of passing helpers to restorer blob
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently here is a bug, because when we see criu's mount namespace,
we go to the "out" mark and don't validate mounts.
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It turned out, that fdopen (used in fopen_proc) always maps
a 4k buffer for reads and this buffer gets unmap-ed later
on fclose.
Taking into account the amount of proc files we read (~20
per task plus one file per opened file descriptor) this
mmap+munmap result in quite a lot of useless CPU time.
E.g. for a container of 20 tasks we have 1000 calls taking
~8% of total dump time.
So lets first stop doing this for simple cases -- one line
proc files.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
So it won't depend on the order in declaration.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They don't change these objects, so can share them
with parent (will be created slightly faster :) ).
The plan is to make them CLONE_VM, but it's not that
easy.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When clone-ing kids we can set their stack on current, as
it will anyway be COW-ed later. One thing to note -- we do
need to reserve some space on the stack for glibc's arguments
and retcode allocation. 128 bytes should be enough for 16
pointers while clone has 5 arguments.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Maintain backwards compatibility for old images, but don't set the REMAP_GHOST
bit going forward, only use the remap_type field.
v2: * preserve remap_id in GHOST_REMAP case
* protobuf field is remap_type enum not u32
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In cr_dump_tasks() we expect restore_root_task to return < 0 if
error ocures.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If there is no separator in first place we should
avoid implicit + 1 which make @name = 1 in worst case.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
My debian testing produces the following output for uname:
$ uname -r
3.14-2-amd64
and so:
$ set -- `uname -r | sed 's/\./ /g'`
$ echo $1
3
$ echo $2
14-2-amd64
this causes zdtm.sh to fail for me on line 293:
[ $1 -eq 3 -a $2 -ge 11 ] && return 0
because "14-2-amd64 -ge 11" is false.
Signed-off-by: Matthias Neuer <matthias.neuer@uni-ulm.de>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This makes only root to be able to modify images by default.
When using criu with suid bit set, group of the images is set
to user group, which is not safe, considering current CR_FD_PERM.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If @ticks is zero the kernel returns error
because on creation the @ticks is already zero,
so simply setup @ticks if real value present.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The PTRACE_SYSCALL traps task twice -- first on enter into
and then on exit from syscall. If we trace a single task (and
we do it on dump two times per task) we may skip half of all
getregs calls -- on exit we don't need them.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We have the same check a few lines above.
v2: fix the subject
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
$ cat /proc/self/mountinfo
...
1 1 0:2 / / rw - rootfs rootfs rw,size=373396k,nr_inodes=93349
...
You can see that mnt_id and parent_mnt_id are equals here.
This patch interpretes this case as a root mount in a tree.
0'th mount is rootfs, which is mounted in init_mount_tree().
We don't see it in cases when system makes chroot, because of
static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
...
/* mountpoints outside of chroot jail will give SEQ_SKIP on this */
err = seq_path_root(m, &mnt_path, &root, " \t\n\\");
Cc: beproject criu <beprojectcriu@gmail.com>
Cc: Christopher Covington <cov@codeaurora.org>
Reported-by: beproject criu <beprojectcriu@gmail.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we compare sets in cg_set_compare() we presume that controller
names are properly sorted but because of use of strcmp(cc->path, path)
it's not true. In particular in case if there are two same sets which
differ in paths only
(00.126812) cg: `- New css ID 2
(00.127051) cg: `- [memory] -> [/vz-1]
(00.127079) cg: `- [name=systemd] -> [/vz-1]
(00.127108) cg: `- [net_cls] -> [/vz-1]
(00.239829) cg: `- New css ID 3
(00.240067) cg: `- [memory] -> [/vz-1]
(00.240096) cg: `- [net_cls] -> [/vz-1]
(00.240154) cg: `- [name=systemd] -> [/vz-1/system.slice/dbus.service]
we currently refuse to dump such configuretion. Thus remove
path comparision from the first place.
CC: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
mntinfo contains mounts from all namespaces, so we can validate it only
once after collecting mounts.
v2: add a fake comment about goto
v3: add a real comment about goto
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
I'm not quite sure what the difference is (I have gcc 4.8, but there are
probably also header differences), but when I compile the service on 14.04 I
get:
CC cr-service.o
cr-service.c: In function ‘start_page_server_req’:
cr-service.c:536:8: error: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Werror=unused-result]
write(start_pipe[1], &ret, sizeof(ret));
^
cr-service.c:544:6: error: ignoring return value of ‘read’, declared with attribute warn_unused_result [-Werror=unused-result]
read(start_pipe[0], &ret, sizeof(ret));
^
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Tested-by: https://travis-ci.org/avagin/criu/builds/34990769
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
tmpfs has the "size" option, which is not standard.
Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we stript options only one of brothers, but
mount_equal() thinks that two brothers should have the same options.
Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Reported-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"continue" is called by mistake, so we skip a few checks for shared
mounts without siblings.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a slight mess with how criu restores root task.
Right now we have the following options.
1) CLI
a) Usually
task calling criu
`- criu
`- root restored task
b) when --restore-detached AND root has pdeath_sig
task calling criu
`- criu
`- root restored task
2) Library/SWRK
task using lib/swrk
`- criu
`- root restored task
3) Standalone service
a) Usually
service
`- service sub task
`- root restored task
b) when root has pdeath_sig
criu service
`- criu sub task
`- root restored task
It would be better is CRIU always restored the root task as sibling,
but we have 3 constraints:
First, the case 1.a is kept for zdtm to run tests in pid namespaces
on 3.11, which in turn doesn't allow CLONE_PARENT | CLONE_NEWPID.
Second, CLI w/o --restore-detach waits for the restored task to die and
this behavior can be "expected" already.
Third, in case of standalone service tasks shouldn't become service's
children.
And I have one "plan". The p.haul project while live migrating tasks
on destination node starts a service, which uses library/swrk mode. In
this case the restored processes become p.haul service's kids which is
also not great.
That said, here's the option called --restore-child that pairs the
--restore-detach like this:
* detached AND child:
task
`- criu restore (exits at the end)
`- root task
The root task will become task's child.
This will be default to library/swrk.
This is what LXC needs.
* detach AND !child
task
`- criu restore (exits at the end)
`- root task
The root task will get re-parented to init.
This will be compatible with 1.3.
This will be default to standalone service and
to my wish with the p.haul case.
* !detach AND child
task
`- criu restore (waits for root task to die)
`- root task
This should be deprecated, so that criu restore doesn't mess
with task <-> root task signalling.
* !detach AND !child
task
`- criu restore (waits for root task to die)
`- root task
This is how plain criu restore works now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
root_as_sibling was used in criu_signals_setup(), but was only defined later
(when forking the root task for the first time). This meant that the
SA_NOCLDSTOP was never masked off, which meant SIGCHLD was never delivered
after ptracing the root task. Thus, when the a child of the root task died
(e.g. from cr_system), the root task sat in PTRACE_STOP, and the restore task
never PTRACE_CONT'd, resulting in a deadlock.
Instead, we only unmask SA_NOCLDSTOP right before we PTRACE_SEIZE, after the
value is defined.
v2: re-work the condition for CLONE_PARENT
v3: move unmasking of SA_NOCLDSTOP to restore_root_task
v4: keep all the comments in the original code
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since the command line for checkpointing and restoring Docker containers
is very long and there are some manual steps involved before restoring
a container, it's much easier to use a shell script to automate the work.
One would simply do:
$ sudo docker_cr.sh -c
$ sudo docker_cr.sh -r
Signed-off-by: Saied Kazemi <saied@google.com>
Acked-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
BTRFS returns subvolume dev-id instead of superblock dev-id,
in such case return device obtained from mountinfo (ie subvolume0).
v2: fix up devices only for btrfs files.
v3: use phys_stat_dev_match instead of phys_stat_resolve_dev
v4: fix cosmetic whims
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It turns out that we can't be too strict about
queued events -- criu itself generates a number
of them and there is no clear way yet how to resolve
this situation. So defer "strict" mode for now
but print a warning.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When service starts page server all the preparations (log, wdir, img dir, etc.)
happen in parent task, then we fork page server.
This is OK for now, but when we will serve several requests per connection, all
these resources would be leaked in parent.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The problem with several requests is that criu leaks resources after
doing dump/restore. It's OK since process exits anyway, but for
multy requests per connection it's better to audit this thing.
For now -- allow to do requests after the page-server-start one only.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It is called from prepare_cgroup_sfd() and cr_restore_tasks().
+ criu restore --file-locks --tcp-established --evasive-devices --link-remap --root /var/lib/vz/root/101 --restore-detached --action-script /usr/local/libexec/vzctl/scripts/vps-rst-env -D /vz/dump/Dump.101 -o restore.log -vvvv --pidfile /var/lib/vzctl/vepid/101
*** Error in `criu': double free or corruption (fasttop): 0x00000000006bcd40 ***
Program terminated with signal 6, Aborted.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-20.fc19.x86_64 libgcc-4.8.3-1.fc19.x86_64 protobuf-c-0.15-7.fc19.x86_64
(gdb) bt
#0 0x00007ffff72179e9 in raise () from /lib64/libc.so.6
#1 0x00007ffff72190f8 in abort () from /lib64/libc.so.6
#2 0x00007ffff7257d17 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff725f0b8 in _int_free () from /lib64/libc.so.6
#4 0x0000000000426971 in cr_restore_tasks () at cr-restore.c:1833
#5 0x0000000000418426 in main (argc=<optimized out>, argv=0x7fffffffeb38, envp=<optimized out>) at crtools.c:479
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>