strstr is a really heavy one, lets use already defined
and filled @file_path variable instead.
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is convenient when need to lookup into debug prints
and check which mount point were used somewhere else
(in particular I will need @mnt_id in tty code so
on error I can easily figure out which mountpoint has
been used).
No func changes.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
No need to invent new error codes here, simply
use ERR_PTR/IS_ERR_OR_NULL and such.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a, well, issue with how we calculate the vma's mnt_id.
Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.
So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.
I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.
Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).
Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If there is no separator in first place we should
avoid implicit + 1 which make @name = 1 in worst case.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we compare sets in cg_set_compare() we presume that controller
names are properly sorted but because of use of strcmp(cc->path, path)
it's not true. In particular in case if there are two same sets which
differ in paths only
(00.126812) cg: `- New css ID 2
(00.127051) cg: `- [memory] -> [/vz-1]
(00.127079) cg: `- [name=systemd] -> [/vz-1]
(00.127108) cg: `- [net_cls] -> [/vz-1]
(00.239829) cg: `- New css ID 3
(00.240067) cg: `- [memory] -> [/vz-1]
(00.240096) cg: `- [net_cls] -> [/vz-1]
(00.240154) cg: `- [name=systemd] -> [/vz-1/system.slice/dbus.service]
we currently refuse to dump such configuretion. Thus remove
path comparision from the first place.
CC: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we stript options only one of brothers, but
mount_equal() thinks that two brothers should have the same options.
Execute zdtm/live/static/mountpoints
./mountpoints --pidfile=mountpoints.pid --outfile=mountpoints.out
Dump 2737
WARNING: mountpoints returned 1 and left running for debug needs
Test: zdtm/live/static/mountpoints, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/mountpoints, Namespace:
Dump log : /root/git/criu/test/dump/static/mountpoints/2737/1/dump.log
--------------------------------- grep Error ---------------------------------
(00.146444) Error (mount.c:399): Two shared mounts 50, 67 have different sets of children
(00.146460) Error (mount.c:402): 67:./zdtm_mpts/dev/share-1 doesn't have a proper point for 54:./zdtm_mpts/dev/share-3/test.mnt.share
(00.146820) Error (cr-dump.c:1921): Dumping FAILED.
------------------------------------- END -------------------------------------
================================= ERROR OVER =================================
Reported-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These guys may have pids that are not met in pstree.
This is not the reason for skipping those, try to
resolve flocks anyway.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we keep the lock type (posix/flock) till the
time we dump it, then "decode" it into binary value.
I will need the easy-to-check one early, so parse the
kind in proc_parse.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the eventpoll image. The eventpoll tfd image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the fanotify image. The fanotify mark image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the inotify image. The inotify wd image will be depricated.
v2: cb() must always free an entry
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping Docker containers using the AUFS graph driver, we can
use the --root option instead of --aufs-root for specifying the
container's root. This patch obviates the need for --aufs-root
and makes dump CLI more consistent with restore CLI.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If the root task is forked in a new pidns, it can't use its pid for
accessing /proc, because this proc belongs to the source pidns.
v2: don't copy a static string.
v3: take a bright part of Tycho's patch
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
proc_parse.c: In function ‘parse_task_cgroup’:
proc_parse.c:1603:16: error: ‘path’ may be used uninitialized in this function [-Werror=uninitialized]
cc1: all warnings being treated as errors
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CID 1168165 (#2 of 2): Untrusted array index read (TAINTED_SCALAR)
40. tainted_data: Using tainted variable "hoff" as an index into an
array "str"
$ man 3 scanf
n Nothing is expected; instead, the number of characters consumed
thus far from the input is stored through the next pointer,
which must be a pointer to int. This is not a conversion,
although it can be suppressed with the * assignment-suppression
character. The C standard says: "Execution of a %n directive
does not increment the assignment count returned at the comple‐
tion of execution" but the Corrigendum seems to contradict this.
Probably it is wise not to make any assumptions on the effect of
%n conversions on the return value.
So it isn't not enough to check a return code from scanf().
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Coverity: 1230177 Dereference before null check
There may be a null pointer dereference, or else the comparison against
null is unnecessary. In parse_task_cgroup: All paths that lead to this
null pointer comparison already dereference the pointer earlier
(CWE-476)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case if something is broken in the kernel and
we get a format corrupted -- simply exit out with
error instead of strlen'ing nil string.
Also while at it -- add a comment about format.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
New kernel 3.16 will have old vDSO zone splitted into the two vmas:
one for vdso code itself and second that named vvar for data been
referenced from vdso code.
Because I can't do 'dump' and 'restore' parts of the code separately
(otherwise test would fail) the commit is pretty big one and hard to
read so here is detailed explanation what's going on.
1) When start dumping we detect vvar zone by reading /proc/pid/smap
and looking up for "[vvar]" token. Note the vvar zone is mapped
by a kernel with PF/IO flags so we should not fail here.
Also it's assumed that at least for now kernel won't be changed
much and [vvar] zone always follows the [vdso] zone, otherwise
criu will print error.
2) In previous commits we disabled dumping vvar area contents so
the restorer code never try to read vvar data but still we need
to map vvar zone thus vma entry remains in image.
3) As with previous vdso format we might have 2 cases
a) Dump and restore is happening on same kernel
b) Dump and restore are done on different kernels
To detect which case we have we parse vdso data from image
and find symbols offsets then compare their values with runtime
symbols provided us by a kernel. If they match and (!!!) the
size of vvar zone is the same -- we simply remap both zones
from runtime kernel into the positions dumpee had at checkpoint
time. This is that named "inplace" remap (a).
If this happens the vdso_proxify() routine drops VMA_AREA_REGULAR
from vvar area provided by a caller code and restorer won't try
to handle this vma. It looks somehow strange and probably should
be reworked but for now I left it as is to minimize the patch.
In case of (b) we need to generate a proxy. We do that in same
way as we were before just include vvar zone into proxy and save
vvar proxy address inside vdso mark injected into vdso area. Thus
on subsequent checkpoint we can detect proxy vvar zone and rip
it off the list of vmas to handle.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we build vDSO handling code for all archs provided
in the source code having some "common" parts inside pie/vdso.c,
pie/vdso-stub.c, vdso-stub.c and vdso.c. This were more or
less well but in new linux kernels (starting from 3.16 presumably)
the vDSO has been significantly reworked so every architecture
must have own vDSO handling engine (just like the kernel does).
So in this patch we move vDSO code to arch specific and because
aarch64 actually doesn't implement proxification yet due to
kernel restrictions -- we drops it out. When there will be
kernel support we bring it back in proper arch/aarch64
implementation.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
vmsplice doesn't work for such VMA-s.
This flags is set in a kernel function remap_pfn_range()
(remap kernel memory to userspace), which is widely used by device
drivers to provide direct access to a device memory.
Reported-by: J F <jgmb45@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"relative path" is absolute path with dot at the beginning.
We already use relative paths on restore. In this patch we add "."
on dump too. It's convinient, because we needed to add dot each time
when we want to access this mount point.
Before this patch we had to created a temporary copy.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used for restoring files from proper mounts.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to parse fdinfo for getting mnt_id,
so we can take there pos and flags and don't call
fcntl and lseek for that.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
An mmaped file is opened O_RDONLY or O_RDWR depending on the permissions
on the first vma dump_task_mm() encounters mapping that file. This
causes two problems:
1. If a file has multiple MAP_SHARED mappings, some of which are
read-only and some of which are read-write, and the first encountered
mapping happens to be read-only, the file will be opened O_RDONLY
during restore, and mmap(PROT_WRITE) will fail with EACCES, causing
the restore to fail.
2. If a file is opened read-write and mapped read-only, it will be
opened O_RDONLY during restore, so restore will succeed, but
mprotect(PROT_WRITE) on the read-only mapping after restore will
fail.
To fix both of these, record open flags per-vma based on the presence of
VM_MAYWRITE in smaps.
Signed-off-by: Jamie Liu <jamieliu@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
ID: 0
signal: 26/ (null)
notify: signal/pid.5954
ClockID: 1
fscanf "%p" doesn't handle "(null)".
https://bugzilla.openvz.org/show_bug.cgi?id=2894
v2: make the original scanf be %d/%s and then additionally
parse the obtained string
v3: don't use strstr
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We spend a lot of time reading the /proc/$pid/smaps file. The time
is spent in two places:
1 kernel puts too many info into it
2 fgets pulls info in 1024-bytes chunks, info about one vma is
typically bigger (up to 3k bytes) thus we call read() ~3 times
per one vma, which increases the amount of time spent in kernel
to re-fill this info
Setting the internal buffer to PAGE_SIZE size reduces the amount of
read()-s on ~60% during basic container dump. Setting bigger buffer
doesn't work, as kernel's seq file engine feeds at most one page of
data per read syscall regardless of the buffer size.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
On restore we only need to know currnet task mappings' start and end
to find where to put the restorer blob. And since the smaps file in
/proc/pid is up to 3 times slower, than the maps one, it makes
perfect sense just to parse the latter one.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we will read all VmaEntries in one big MmEntry object,
so to avoif copying them all into vma_areas, make them be pointable.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In /proc/<pid>/smaps/ output we may omit testing
for capital hex letters, since we know the format
kernel provides.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>