validate_mounts() prints ->mnt_id in hex when it reports the failure.
This complicates the understanding because this ->mnt_id is printed as
decimal elsewhere, including /proc/$pid/mountinfo.
parse_mountinfo() adds "0x" at least and this is just pr_info(), but
lets change it too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
parse_smaps() is too big for easy reading. In addition, we are
creating a new interface to get information about processes, which is
called taskdiag, so parse_smaps() will do only what it should do
accoding with the name. All other should be moved in a separate
functions which will be reused to work with task_diag.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
parse_smaps() is too big for easy reading. In addition, we are
creating a new interface to get information about processes, which is
called taskdiag, so parse_smaps() will do only what it should do
accoding with the name. All other should be moved in a separate
functions which will be reused to work with task_diag.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Starting with version 3.15, the kernel provides a mnt_id field in
/proc/<pid>/fdinfo/<fd>. However, the value provided by the kernel for
AUFS file descriptors obtained by opening a file in /proc/<pid>/map_files
is incorrect.
Below is an example for a Docker container running Nginx. The mntid
program below mimics CRIU by opening a file in /proc/1/map_files and
using the descriptor to obtain its mnt_id. As shown below, mnt_id is
set to 22 by the kernel but it does not exist in the mount namespace of
the container. Therefore, CRIU fails with the error:
"Unable to look up the 22 mount"
In the global namespace, 22 is the root of AUFS (/var/lib/docker/aufs).
This patch sets the mnt_id of these AUFS descriptors to -1, mimicing
pre-3.15 kernel behavior.
$ docker ps
CONTAINER ID IMAGE ...
3850a63ee857 nginx-streaming:latest ...
$ docker exec -it 38 bash -i
root@3850a63ee857:/# ps -e
PID TTY TIME CMD
1 ? 00:00:00 nginx
7 ? 00:00:00 nginx
31 ? 00:00:00 bash
46 ? 00:00:00 ps
root@3850a63ee857:/# ./mntid 1
open("/proc/1/map_files/400000-4b8000") = 3
cat /proc/49/fdinfo/3
pos: 0
flags: 0100000
mnt_id: 22
root@3850a63ee857:/# awk '{print $1 " " $2}' /proc/1/mountinfo
87 58
103 87
104 87
105 104
106 104
107 104
108 87
109 87
110 87
111 87
root@3850a63ee857:/# exit
$ grep 22 /proc/self/mountinfo
22 21 8:1 /var/lib/docker/aufs /var/lib/docker/aufs ...
44 22 0:35 / /var/lib/docker/aufs/mnt/<ID> ...
$
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch reworks fixup_aufs_vma_fd() to let symbolic links in
/proc/<pid>/map_files that are not pointing to AUFS branch names follow
the non-AUFS applcation logic.
The use case that prompted this commit was an application mapping
/dev/zero as shared and writeable which shows up in map_files as:
lrw------- ... 7fc5c5a5f000-7fc5c5a60000 -> /dev/zero (deleted)
If the AUFS support code reads the link, it will have to strip off the
" (deleted)" string added by the kernel but core CRIU code already
does this.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When AIO context is set up kernel does two things:
1. creates an in-kernel aioctx object
2. maps a ring into process memory
The 2nd thing gives us all the needed information
about how the AIO was set up. So, in order to dump
one we need to pick the ring in memory and get all
the information we need from it.
One thing to note -- we cannot dump tasks if there
are any AIO requests pending. So we also need to
go to parasite and check the ring to be empty.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We will need to detect aio mappings soon, so this is a preparation,
that makes future patching simpler.
Also move aufs stat-ing into aufs code to keep more aufs logic in
one place.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: don't leak FILE
CID 73423 (#1 of 1): Resource leak (RESOURCE_LEAK)
15. leaked_storage: Variable f going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All out processes are stopped in a moment, when file locks are
collected, so they can't to wait any locks.
Here is a proof of this theory:
[root@avagin-fc19-cr ~]# flock xxx sleep 1000 &
[1] 23278
[root@avagin-fc19-cr ~]# flock xxx sleep 1000 &
[2] 23280
[root@avagin-fc19-cr ~]# cat /proc/locks
1: FLOCK ADVISORY WRITE 23278 08:03:280001 0 EOF
1: -> FLOCK ADVISORY WRITE 23280 08:03:280001 0 EOF
[root@avagin-fc19-cr ~]# gdb -p 23280
(gdb) ^Z
[3]+ Stopped gdb -p 23280
[root@avagin-fc19-cr ~]# cat /proc/locks
1: FLOCK ADVISORY WRITE 23278 08:03:280001 0 EOF
Currently criu can dump nothing, if we have one process which is
waiting a lock. I don't see any reason to do this.
v2: typo fix
Cc: Qiang Huang <h.huangqiang@huawei.com>
Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CID 73370: Resource leak (RESOURCE_LEAK)
13. leaked_storage: Variable timer going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
strstr is a really heavy one, lets use already defined
and filled @file_path variable instead.
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is convenient when need to lookup into debug prints
and check which mount point were used somewhere else
(in particular I will need @mnt_id in tty code so
on error I can easily figure out which mountpoint has
been used).
No func changes.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
No need to invent new error codes here, simply
use ERR_PTR/IS_ERR_OR_NULL and such.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a, well, issue with how we calculate the vma's mnt_id.
Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.
So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.
I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.
Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).
Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If there is no separator in first place we should
avoid implicit + 1 which make @name = 1 in worst case.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we compare sets in cg_set_compare() we presume that controller
names are properly sorted but because of use of strcmp(cc->path, path)
it's not true. In particular in case if there are two same sets which
differ in paths only
(00.126812) cg: `- New css ID 2
(00.127051) cg: `- [memory] -> [/vz-1]
(00.127079) cg: `- [name=systemd] -> [/vz-1]
(00.127108) cg: `- [net_cls] -> [/vz-1]
(00.239829) cg: `- New css ID 3
(00.240067) cg: `- [memory] -> [/vz-1]
(00.240096) cg: `- [net_cls] -> [/vz-1]
(00.240154) cg: `- [name=systemd] -> [/vz-1/system.slice/dbus.service]
we currently refuse to dump such configuretion. Thus remove
path comparision from the first place.
CC: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we stript options only one of brothers, but
mount_equal() thinks that two brothers should have the same options.
Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Reported-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These guys may have pids that are not met in pstree.
This is not the reason for skipping those, try to
resolve flocks anyway.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we keep the lock type (posix/flock) till the
time we dump it, then "decode" it into binary value.
I will need the easy-to-check one early, so parse the
kind in proc_parse.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the eventpoll image. The eventpoll tfd image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the fanotify image. The fanotify mark image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the inotify image. The inotify wd image will be depricated.
v2: cb() must always free an entry
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root. This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If the root task is forked in a new pidns, it can't use its pid for
accessing /proc, because this proc belongs to the source pidns.
v2: don't copy a static string.
v3: take a bright part of Tycho's patch
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
proc_parse.c: In function ‘parse_task_cgroup’:
proc_parse.c:1603:16: error: ‘path’ may be used uninitialized in this function [-Werror=uninitialized]
cc1: all warnings being treated as errors
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CID 1168165 (#2 of 2): Untrusted array index read (TAINTED_SCALAR)
40. tainted_data: Using tainted variable "hoff" as an index into an
array "str"
$ man 3 scanf
n Nothing is expected; instead, the number of characters consumed
thus far from the input is stored through the next pointer,
which must be a pointer to int. This is not a conversion,
although it can be suppressed with the * assignment-suppression
character. The C standard says: "Execution of a %n directive
does not increment the assignment count returned at the comple‐
tion of execution" but the Corrigendum seems to contradict this.
Probably it is wise not to make any assumptions on the effect of
%n conversions on the return value.
So it isn't not enough to check a return code from scanf().
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Coverity: 1230177 Dereference before null check
There may be a null pointer dereference, or else the comparison against
null is unnecessary. In parse_task_cgroup: All paths that lead to this
null pointer comparison already dereference the pointer earlier
(CWE-476)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case if something is broken in the kernel and
we get a format corrupted -- simply exit out with
error instead of strlen'ing nil string.
Also while at it -- add a comment about format.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>