Unfortunately, SECCOMP_MODE_FILTER is not currently exposed to userspace,
so we can't checkpoint that. In any case, this is what we need to do for
SECCOMP_MODE_STRICT, so let's do it.
This patch works by first disabling seccomp for any processes who are going
to have seccomp filters restored, then restoring the process (including the
seccomp filters), and finally resuming the seccomp filters before detaching
from the process.
v2 changes:
* update for kernel patch v2
* use protobuf enum for seccomp type
* don't parse /proc/pid/status twice
v3 changes:
* get rid of extra CR_STAGE_SECCOMP_SUSPEND stage
* only suspend seccomp in finalize_restore(), just before the unmap
* restore the (same) seccomp state in threads too; also add a note about
how this is slightly wrong, and that we should at least check for a
mismatch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
00:03:27.746 (00.008815) Error (bfd.c:149): bfd: Error reading file: No such process
Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CRIU always retores the mounts as MNT_RELATIME. This is because the
kernel uses this mode by default, so we need to pass MS_STRICTATIME
explicitely if we didn't see "noatime" or "MS_RELATIME".
While at it, make mnt_opt2flag[] and sb_opt2flag "static", otherwise
gcc actually creates these arrays on stack even if there are "const".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
/proc/locks can contain a wrong pid for a lock and we always need to
check this fact. Starting with the 4.1 kernel, locks are reported
in fdinfo.
v2: rebase to the curret master
skip note_file_lock()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We should not have a chance to exit with a wrong code on error
paths.
|^^^\
| \________________
| ** |_\
\_______/^^^^^^^/_____/
/ /
/ /
/____/
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We can simply overwrite the dot symbol right after the kernel reports
it to us.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Which obviously can be used to "ignore" the mounts we do not want or
need to dump. The user should know what he does.
Note: this patch changes parse_mountinfo() to check should_skip_mount().
This is because imo we want to filter out the unwanted mounts asap, af
if they do not exist. This increases the chances the dumping will fail
if something else depends on this mount. Say, another mountpoint or an
opened file.
Perhaps it makes sense to teach should_skip_mount() to use fnmatch()
and/or look at the optional "(fs|mnt)=" prefix to skip by fsname too.
To me it would be better to force the user of this option to understand
what it does. Say, if "dump" fails because the child mount can't find
the skipped parent, he should add another --skip-mnt option or do not
dump. Otherwise, if we do this automagically the user can probably be
surpised, he might even miss the fact that we skip more than he asked.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preparation.
1. Add the new "bool for_dump" arg to collect/parse_mntinfo().
2. Introduce "struct collect_mntns_arg" to pass the additional
"bool for_dump" field to collect_mntinfo() and change it to
pass this boolean to collect_mntinfo()->parse_mountinfo() path.
3. Change other callers of collect_mntinfo() to pass "false".
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have two helpers for VMA type testing: privately_dump_vma() and vma_priv(). They
work with different types but basically do the same: check if we should dump VMA into
the image and restore it back then.
Lets unify they both into common vma_entry_is_private() helper and vma_area_is_private()
for working with vma_area type.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is pure theoretical, especially in this particular case when we
actually want to (likely) free the unused memory. Still the code which
ignores potential error doesn't look good.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
1. parse_mountinfo_ent() mixes "return -1" and "goto err" on failure,
this looks confusing and inconsistent.
2. And buggy. It forgets to free(opt) if parse_mnt_flags() fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The caller will do this on failure too. So this is unnecessary and wrong
because we do not nullify ->mountpoint.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
1. parse_mountinfo() forgets to free(fst) if parse_mountinfo_ent()
succeeds.
2. The usage of fst/r_fstype is ovecomplicated for no reason.
Just change the parse_mountinfo() paths to populate/use/free this
fsname unconditionally, and move the ownership to the caller. There
is no reason to check FSTYPE__UNSUPPORTED and/or fallback to ->name.
Better yet, we could even turn fsname into the local "char []" and
avoid %ms and free(), but then we would need to pass the length of
this buffer to parse_mountinfo_ent().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preparation to simplify the review. parse_mountinfo() assumes that:
1. The "err:" block does all the necessary cleanups on failure.
This is wrong, see the next patch.
2. We can never skip the mountpoint.
This is true, but we are going to change this.
s/goto err/goto end/ in the main loop, add the "end:" label which inserts
the new mount_info into the list and then checks ret != 0 to figure out
whether we need to abort.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use the format specifier PRIx64 instead of %lx to print uint64.
integer.
Reported-by: Mr Travis CI
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
validate_mounts() prints ->mnt_id in hex when it reports the failure.
This complicates the understanding because this ->mnt_id is printed as
decimal elsewhere, including /proc/$pid/mountinfo.
parse_mountinfo() adds "0x" at least and this is just pr_info(), but
lets change it too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
parse_smaps() is too big for easy reading. In addition, we are
creating a new interface to get information about processes, which is
called taskdiag, so parse_smaps() will do only what it should do
accoding with the name. All other should be moved in a separate
functions which will be reused to work with task_diag.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
parse_smaps() is too big for easy reading. In addition, we are
creating a new interface to get information about processes, which is
called taskdiag, so parse_smaps() will do only what it should do
accoding with the name. All other should be moved in a separate
functions which will be reused to work with task_diag.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Starting with version 3.15, the kernel provides a mnt_id field in
/proc/<pid>/fdinfo/<fd>. However, the value provided by the kernel for
AUFS file descriptors obtained by opening a file in /proc/<pid>/map_files
is incorrect.
Below is an example for a Docker container running Nginx. The mntid
program below mimics CRIU by opening a file in /proc/1/map_files and
using the descriptor to obtain its mnt_id. As shown below, mnt_id is
set to 22 by the kernel but it does not exist in the mount namespace of
the container. Therefore, CRIU fails with the error:
"Unable to look up the 22 mount"
In the global namespace, 22 is the root of AUFS (/var/lib/docker/aufs).
This patch sets the mnt_id of these AUFS descriptors to -1, mimicing
pre-3.15 kernel behavior.
$ docker ps
CONTAINER ID IMAGE ...
3850a63ee857 nginx-streaming:latest ...
$ docker exec -it 38 bash -i
root@3850a63ee857:/# ps -e
PID TTY TIME CMD
1 ? 00:00:00 nginx
7 ? 00:00:00 nginx
31 ? 00:00:00 bash
46 ? 00:00:00 ps
root@3850a63ee857:/# ./mntid 1
open("/proc/1/map_files/400000-4b8000") = 3
cat /proc/49/fdinfo/3
pos: 0
flags: 0100000
mnt_id: 22
root@3850a63ee857:/# awk '{print $1 " " $2}' /proc/1/mountinfo
87 58
103 87
104 87
105 104
106 104
107 104
108 87
109 87
110 87
111 87
root@3850a63ee857:/# exit
$ grep 22 /proc/self/mountinfo
22 21 8:1 /var/lib/docker/aufs /var/lib/docker/aufs ...
44 22 0:35 / /var/lib/docker/aufs/mnt/<ID> ...
$
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch reworks fixup_aufs_vma_fd() to let symbolic links in
/proc/<pid>/map_files that are not pointing to AUFS branch names follow
the non-AUFS applcation logic.
The use case that prompted this commit was an application mapping
/dev/zero as shared and writeable which shows up in map_files as:
lrw------- ... 7fc5c5a5f000-7fc5c5a60000 -> /dev/zero (deleted)
If the AUFS support code reads the link, it will have to strip off the
" (deleted)" string added by the kernel but core CRIU code already
does this.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When AIO context is set up kernel does two things:
1. creates an in-kernel aioctx object
2. maps a ring into process memory
The 2nd thing gives us all the needed information
about how the AIO was set up. So, in order to dump
one we need to pick the ring in memory and get all
the information we need from it.
One thing to note -- we cannot dump tasks if there
are any AIO requests pending. So we also need to
go to parasite and check the ring to be empty.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We will need to detect aio mappings soon, so this is a preparation,
that makes future patching simpler.
Also move aufs stat-ing into aufs code to keep more aufs logic in
one place.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: don't leak FILE
CID 73423 (#1 of 1): Resource leak (RESOURCE_LEAK)
15. leaked_storage: Variable f going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All out processes are stopped in a moment, when file locks are
collected, so they can't to wait any locks.
Here is a proof of this theory:
[root@avagin-fc19-cr ~]# flock xxx sleep 1000 &
[1] 23278
[root@avagin-fc19-cr ~]# flock xxx sleep 1000 &
[2] 23280
[root@avagin-fc19-cr ~]# cat /proc/locks
1: FLOCK ADVISORY WRITE 23278 08:03:280001 0 EOF
1: -> FLOCK ADVISORY WRITE 23280 08:03:280001 0 EOF
[root@avagin-fc19-cr ~]# gdb -p 23280
(gdb) ^Z
[3]+ Stopped gdb -p 23280
[root@avagin-fc19-cr ~]# cat /proc/locks
1: FLOCK ADVISORY WRITE 23278 08:03:280001 0 EOF
Currently criu can dump nothing, if we have one process which is
waiting a lock. I don't see any reason to do this.
v2: typo fix
Cc: Qiang Huang <h.huangqiang@huawei.com>
Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CID 73370: Resource leak (RESOURCE_LEAK)
13. leaked_storage: Variable timer going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
strstr is a really heavy one, lets use already defined
and filled @file_path variable instead.
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is convenient when need to lookup into debug prints
and check which mount point were used somewhere else
(in particular I will need @mnt_id in tty code so
on error I can easily figure out which mountpoint has
been used).
No func changes.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
No need to invent new error codes here, simply
use ERR_PTR/IS_ERR_OR_NULL and such.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a, well, issue with how we calculate the vma's mnt_id.
Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.
So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.
I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.
Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).
Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>