We have a set of routines that open /proc/$pid files via proc service
descriptor. Teach them to accept non-pids as pids to open /proc/self/*
and /proc/* files via the same engine.
Signed-f-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This fixes the support for fifo-s in mount namespaces and
makes it easier to control the correct open_path() usage in
the future.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore find out in which sets tasks live in and move
them there.
Optimization note -- move tasks into cgroups _before_ fork
kids to make them inherit cgroups if required. This saves
a lot of time.
Accessibility note -- when moving tasks into cgroups don't
search for existing host mounts (they may be not available)
and don't mount temporary ones (may be impossible due to
user namespaces). Instead introduce service fd with a yard
of mounts.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Each task points to a single ID of cgroup-set it lives in. This
is done so to save some space in the image, as tasks likely
live in the same set of cgroups.
Other than this we keep track of what cgroup set we dump the
subtree from. If it happens, that root task lives in the same
cgroup set as criu does, we don't allow for any other sub-cgroups
and make restore (next patch) much simpler and faster.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The exact structure of the image will be revealed in the
next patch(es). What is important here, is that cgroup
image is somewhat new.
It will likely contain arrays of objects of different types,
so I introduce the "header" object, that will link these
arrays using pb repeated fields. This will help us to avoid
many image files for different cgroup objects and will make
the amount of write()-s required be 1.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we build vDSO handling code for all archs provided
in the source code having some "common" parts inside pie/vdso.c,
pie/vdso-stub.c, vdso-stub.c and vdso.c. This were more or
less well but in new linux kernels (starting from 3.16 presumably)
the vDSO has been significantly reworked so every architecture
must have own vDSO handling engine (just like the kernel does).
So in this patch we move vDSO code to arch specific and because
aarch64 actually doesn't implement proxification yet due to
kernel restrictions -- we drops it out. When there will be
kernel support we bring it back in proper arch/aarch64
implementation.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Guard vDSO code with CONFIG_VDSO, no need to even build it
on archs which do not support vDSO handling.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
vmsplice doesn't work for such VMA-s.
This flags is set in a kernel function remap_pfn_range()
(remap kernel memory to userspace), which is widely used by device
drivers to provide direct access to a device memory.
Reported-by: J F <jgmb45@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preserve the dumpable flag, which affects whether a core dump will be
generated, but also affects the ownership of the virtual files under
/proc/$pid after restoring a process.
Tested: Restored a process with a criu including this patch and looked
at /proc/$pid to confirm that the virtual files were no longer all owned
by root:root.
zdtm tests pass except for cow01 which seems to be broken.
(see https://bugzilla.openvz.org/show_bug.cgi?id=2967 for details.)
This patch fixes https://bugzilla.openvz.org/show_bug.cgi?id=2968
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Change-Id: I8c386508448a84368a86666f2d7500b252a78bbf
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's possible that a procfs mounted somewhere other than /proc
is in use.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Nowadays this routine is mainly used for getting an
fd, rather than keeping one for future reference.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we don't know mnt_id, we don't know to which namespace a file
belongs.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch removes the global mntinfo_tree and collect_mount_info where
it was constructed. The mntinfo list is filled from dump_mnt_ns,
rst_collect_local_mntns, collect_mnt_namespaces and read_mnt_ns_img.
A mountinfo entry contains a reference on a proper ns_id entry, so
we cau use mnt_id to look up a proper mount namespace.
v2: remove trash after rebasing.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Kernels before 3.15 doesn't show mnt_id and mnt_id isn't saved in
images, if mntns isn't dumped.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore all mount namespaces are restored in the root mntns and
sub-namecpeaces are restored in temorary places.
This function allows to get paths to these places.
It will be used in open_remap_ghost(), because it's called in the root
task, when other tasks are not forked yet.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mntns, so the global mntinfo_tree
variable are useless and information about tree should be connected
to a proper namespace.
But when we don't dump mntns, we need to collect mounts for the current
mntns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mount namespaces and each NS has own
tree. The mount tree is used for checking that a file is reachable.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
One device can be mounted a few times, so files are identical only,
if they have the same mnt_id.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
because we want to check, that all files are reachable.
For that we need to collect all mounts from all namespaces.
v2: dump mntns separately
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now we supports sub-mntns, so root_ns_mask sounds more correct than
current_ns_mask.
v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: another attempt to write readable code:)
v3: clean up
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Known issue:
* currently only namespaces with the same root is supported
* nested namespaces can be dumped and restored only if the root task
has own mount namespace.
All nested namespaces are restored in a root namespace in temporary
directories. All mount points restored in one tree and then they are
divided into namesaces.
The task with minimal pid for each namespaces unshared mntns and
then it makes pivot_root in a proper temporary directory. All other
tasks makes setns to enter into a mount namespace of the task with
minimal pid.
v2: clean up
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
All non-root namespaces will be restored as sub-trees of the root tree.
This patch adds helpers to create a temporary directory and mount tmpfs
in it, then create directories for each non-root mount namespace.
tmpfs is quite useful here to simplify destroying this construction,
we don't need to unmount each namespace separately.
v2: add a comment why MNT_DETACH is not dangerous here
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we'll restore nested mount namespaces, all but root ones (sub-namespaces)
will be restored as sub-mounts in the root mount namespace. So mi->mountpoint
will be not '/' even if a mount is root for its mntns.
v2: s/is_root/is_ns_root/
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently ns_ids list is filled only on dump. Soon we'll need this
list for mount namespaces on restore, e.g. to know which tasks share
the namespaces.
v2: merge the patch "namespace: add a function to search an ns_id
item by id" into this one.
v3: add prefix rst_ to add_ns_id
v4: look up namespace by two values -- type AND ID
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a race. Consider we have 3 tasks, A, B and C. A and B
share fdtable, C -- does not. Then we might be in a situation
when A is restoring memory reading mem images, and B -- forking
the C child. In that case descriptors held by A (for mem restore)
will be inherited by C and will not get closed.
This reverts commit d36e07aabe073993d8ae9695e33f6e45b2eb6a21.
BTRFS returns subvolume dev-id instead of superblock dev-id,
so we need to know which mounts are btrfs.
The mi->fstype->name is "unsuppoerted" here, because the fstype->code
is saved in an image
{
.name = "unsupported",
.code = FSTYPE__UNSUPPORTED,
},
{
.name = "btrfs",
.code = FSTYPE__UNSUPPORTED,
}
An a second reason is that pocesses can be migrated from smth to btrfs.
This all can happen _only_ for the root mount and for bind mounts of
the root mount...
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"relative path" is absolute path with dot at the beginning.
We already use relative paths on restore. In this patch we add "."
on dump too. It's convinient, because we needed to add dot each time
when we want to access this mount point.
Before this patch we had to created a temporary copy.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's already used for dumping files and it will be used for restoring,
so it should be service fd to avoid intersection with restored
descriptors.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This as well gives us minus one image per-task and
allocates more space on core task entry.
One thing to note -- the amount of posix timers is
not easily accessible at the core entry allocation
time, so the respective array is allocated on demand.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This allows to have one image less per-task, which in turn
reduces live migration time a little bit.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used for restoring files from proper mounts.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>