This is here only to support the Linux Kernel between versions
3.18 and 4.2. After that, this workaround is not needed anymore,
but it will work properly on both a kernel with and without the bug.
The bug is that when a process has a file open in an OverlayFS directory,
the information in /proc/<pid>/fd/<fd> and /proc/<pid>/fdinfo/<fd>
is wrong, so we grab that information from the mountinfo table instead.
This is done every time fill_fdlink is called.
We first check to see if the mnt_id and st_dev numbers currently match
some entry in the mountinfo table. If so, we already have the correct mnt_id
and no fixup is needed.
Then we proceed to see if there are any overlayFS mounted directories
in the mountinfo table. If so, we concatenate the mountpoint with the
name of the file, and stat the resulting path to check if we found the
correct device id and node number. If that is the case, we update the
mount id and link variables with the correct values.
Signed-off-by: Gabriel Guimaraes <gabriellimaguimaraes@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We use native cpuid, so this one is no longer used.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since we don't support dumping per-thread creds, let's at least fail to
dump if the creds don't match.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Unfortunately, SECCOMP_MODE_FILTER is not currently exposed to userspace,
so we can't checkpoint that. In any case, this is what we need to do for
SECCOMP_MODE_STRICT, so let's do it.
This patch works by first disabling seccomp for any processes who are going
to have seccomp filters restored, then restoring the process (including the
seccomp filters), and finally resuming the seccomp filters before detaching
from the process.
v2 changes:
* update for kernel patch v2
* use protobuf enum for seccomp type
* don't parse /proc/pid/status twice
v3 changes:
* get rid of extra CR_STAGE_SECCOMP_SUSPEND stage
* only suspend seccomp in finalize_restore(), just before the unmap
* restore the (same) seccomp state in threads too; also add a note about
how this is slightly wrong, and that we should at least check for a
mismatch
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
/proc/locks can contain a wrong pid for a lock and we always need to
check this fact. Starting with the 4.1 kernel, locks are reported
in fdinfo.
v2: rebase to the curret master
skip note_file_lock()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
With this flag, external shared bind mounts are attempted to be resolved
automatically.
v2: don't always assume when the sharing matches that the mount matches
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Which obviously can be used to "ignore" the mounts we do not want or
need to dump. The user should know what he does.
Note: this patch changes parse_mountinfo() to check should_skip_mount().
This is because imo we want to filter out the unwanted mounts asap, af
if they do not exist. This increases the chances the dumping will fail
if something else depends on this mount. Say, another mountpoint or an
opened file.
Perhaps it makes sense to teach should_skip_mount() to use fnmatch()
and/or look at the optional "(fs|mnt)=" prefix to skip by fsname too.
To me it would be better to force the user of this option to understand
what it does. Say, if "dump" fails because the child mount can't find
the skipped parent, he should add another --skip-mnt option or do not
dump. Otherwise, if we do this automagically the user can probably be
surpised, he might even miss the fact that we skip more than he asked.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preparation.
1. Add the new "bool for_dump" arg to collect/parse_mntinfo().
2. Introduce "struct collect_mntns_arg" to pass the additional
"bool for_dump" field to collect_mntinfo() and change it to
pass this boolean to collect_mntinfo()->parse_mountinfo() path.
3. Change other callers of collect_mntinfo() to pass "false".
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: don't leak FILE
CID 73423 (#1 of 1): Resource leak (RESOURCE_LEAK)
15. leaked_storage: Variable f going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a, well, issue with how we calculate the vma's mnt_id.
Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.
So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.
I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.
Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the eventpoll image. The eventpoll tfd image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the fanotify image. The fanotify mark image will be depricated.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to collect all objects in a list and write them into
the inotify image. The inotify wd image will be depricated.
v2: cb() must always free an entry
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The AUFS support code handles the "bad" information that we get from
the kernel in /proc/<pid>/map_files and /proc/<pid>/mountinfo files.
For details see comments in sysfs_parse.c.
The main motivation for this work was dumping and restoring Docker
containers which by default use the AUFS graph driver. For dump,
--aufs-root <container_root> should be added to the command line options.
For restore, there is no need for AUFS-specific command line options
but the container's AUFS filesystem should already be set up before
calling criu restore.
[ xemul: With AUFS files sometimes, in particular -- in case of a
mapping of an executable file (likekely the one created at elf load),
in the /proc/pid/map_files/xxx link target we see not the path
by which the file is seen in AUFS, but the path by which AUFS
accesses this file from one of its "branches". In order to fix
the path we get the info about branches from sysfs and when we
meet such a file, we cut the branch part of the path. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we add a temporary root to a mount point path. It's convinient
for restoring mount namespaces, but real paths are used for restoring
link-remap files.
v2: replace the offset field on a char * field
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A file system can be bind-mounted a few times and some of these mounts
can be non-root. We need to find one of root mounts and dump it.
v2: don't forget to check pm->dumped and pm->parent
don't dump a root file system, it's always external for now.
Reported-by: Saied Kazemi <saied@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.
On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.
v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we don't know mnt_id, we don't know to which namespace a file
belongs.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we'll restore nested mount namespaces, all but root ones (sub-namespaces)
will be restored as sub-mounts in the root mount namespace. So mi->mountpoint
will be not '/' even if a mount is root for its mntns.
v2: s/is_root/is_ns_root/
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"relative path" is absolute path with dot at the beginning.
We already use relative paths on restore. In this patch we add "."
on dump too. It's convinient, because we needed to add dot each time
when we want to access this mount point.
Before this patch we had to created a temporary copy.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used for restoring files from proper mounts.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to parse fdinfo for getting mnt_id,
so we can take there pos and flags and don't call
fcntl and lseek for that.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we only need to know currnet task mappings' start and end
to find where to put the restorer blob. And since the smaps file in
/proc/pid is up to 3 times slower, than the maps one, it makes
perfect sense just to parse the latter one.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The existing code opens "self" and parses what's in there,
just twist the code a little to accept generic pid.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All the entries with with_plugin set will be mounted by plugin.
The interesting case is when we do the pivot-root restore. In this
case we call restore callback very early (before we unmount the old
tree) and ask it to create the mountpoint at temporary location.
Later we move the mount to proper place.
The old_root argument of the callback is where it can find files
in the original mount namespace.
The is_file is return-argument. Sine files and directories cannot be
bind-mounted to each-other, the callback should create the mountpoint
itself and report whether it created file or directory.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
External bind mounts are those with source sitting outside of the
current FS view. Such are detected in validate_mounts(), so we
just go ahead and call plugins.
The plugin is provided with the mountpoint to decide whether it's
his or not (what else does the guy need?) and an ID with this it
can identify the mountpoint in /proc. The same ID will be used at
restore time to find the needed restore info.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a mess of uintX_t and uX usage. Drop off uintX_t ones.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We will need to parse btrfs stuff, but this one is not
in the supported list yet (as it's bound to hardware).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
It will hold auxiliary data associated with mount point. We
will use it for btrfs handling but in future it can be extended.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used in cr-restore.c for stopping threads on the exit from
sigreturn.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A few sentences, which are required for understanging this patch
2a) A shared mount can be replicated to as many mountpoints and all the
replicas continue to be exactly same.
2b) A slave mount is like a shared mount except that mount and umount
events only propagate towards it.
2c) A private mount does not forward or receive propagation.
All rules is there Documentation/filesystems/sharedsubtree.txt
If it's a first mount in a group, all group members should be
bind-mounted from this one.
Each mount propagates to all members of parent's group. The group can
contains a few slaves.
Mounts, which have propagated to slaves, are unmounted, because we can't
be sure, that they propagated in real life. For example:
mount --bind --make-slave /share /slave1
mount --bind --make-slave /share /slave2
mount /share/test
umount /slave2/test
mount --make-share /slave1/test
mount --bind --make-share /slave1/test /slave2/test
41 40 0:33 / /share rw,relatime shared:28 - tmpfs xxx rw
42 40 0:33 / /slave1 rw,relatime master:28 - tmpfs xxx rw
43 40 0:33 / /slave2 rw,relatime master:28 - tmpfs xxx rw
44 41 0:34 / /share/test rw,relatime shared:29 - tmpfs xxx rw
46 42 0:34 / /slave1/test rw,relatime shared:30 master:29 - tmpfs xxx rw
45 43 0:34 / /slave2/test rw,relatime shared:30 master:29 - tmpfs xxx rw
/slave1/test and /slave2/test depend on each other and minimum one of them
doesn't propagate from /share/test
v2: use false and true for bool
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Try to restore mounts while a postpone list isn't empty and check
that each iteration has some progress, otherwice it will fails for
preventing infinite loops
v2: rework logic about postpone list
add more comments
v3: one more attempt to make it more readable
v4: Here is a master class from Pavel how to write self-documented code.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All shared mounts from one group are connected to circular list.
All slave are added into the proper master list.
v2: change variable name and fix a bug about adding shared mounts in a
circular list.
v3: handle errors of collect_shared
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
mnt_entry contains a few strings and they should be release too
CID 996198 (#4 of 4): Resource leak (RESOURCE_LEAK)
20. leaked_storage: Variable "pm" going out of scope leaks the storage
it points to.
CID 996190 (#1 of 1): Resource leak (RESOURCE_LEAK)
13. leaked_storage: Variable "new" going out of scope leaks the storage
it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Right now when we collect list of vmas we need to know the
number of elements in it. In the future I will need to know
more, so it makes sense to create a vmas-list object for it.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
They are really depends on CPU we're running on.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
No need to walk up the directories if we need
to include protobuf file. This was always a bad
use of ability to walk the filesystem from other
headers.
Same time we don't need -I$(SRC_DIR)/protobuf/
in general makefile anymore.
[xemul: Small fixlet in head Makefile, since patch
it out-of-order]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>