We need it to be able to dump signals into cores
before calling parasite_infect_seized().
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Needed for future user namespace support. Capabilities will have to be
dumped from the parasite, ie from inside the namespace since there is no
obvious way to 'translate' capabilities from the global namespace (unlike
with uids and gids, where the id mappings can be used for translation).
[ additional explanation from Andrew Vagin:
"capabilities" are not translated between namespaces. They can exist
only in one userns, where a process lives. If a process is created in a
new userns, it gets a full set of capabilities in this userns, and
loses all caps in a parent userns.
So if capabilities are not shown in /proc/pid/stat, we have no way to
get it except of using parasite code. ]
Signed-off-by: Sophie Blee-Goldman <ableegoldman@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For that mnt namespaces should be dumped after files.
v2: rework enumeration of namespaces in dump_mnt_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The implementation is pretty straightforward. When dumping per-thread
misc data with parasite, collect one, then write in thread_core_info.
On restore wait for creds restore and put the value back (some creds
changes drop it to zero).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Robust lists may be disabled, for example if the "futex_cmpxchg_enabled"
variable in the kernel is unset.
Detect that case by checking that both "get_robust_list" and "set_robust_list"
syscalls return ENOSYS and do not make criu dump fail in that case, but simply
assume an empty list, which is consistent with the syscalls not being
available.
Tested: Successfully ran the zdtm test suite on a kernel where the
"get_robust_list" and "set_robust_list" syscalls are disabled.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
vvar zone is mapped by a kernel and must not ever
been dumped into image, the data present there is
valid on running kernel only.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Each task points to a single ID of cgroup-set it lives in. This
is done so to save some space in the image, as tasks likely
live in the same set of cgroups.
Other than this we keep track of what cgroup set we dump the
subtree from. If it happens, that root task lives in the same
cgroup set as criu does, we don't allow for any other sub-cgroups
and make restore (next patch) much simpler and faster.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Preserve the dumpable flag, which affects whether a core dump will be
generated, but also affects the ownership of the virtual files under
/proc/$pid after restoring a process.
Tested: Restored a process with a criu including this patch and looked
at /proc/$pid to confirm that the virtual files were no longer all owned
by root:root.
zdtm tests pass except for cow01 which seems to be broken.
(see https://bugzilla.openvz.org/show_bug.cgi?id=2967 for details.)
This patch fixes https://bugzilla.openvz.org/show_bug.cgi?id=2968
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Change-Id: I8c386508448a84368a86666f2d7500b252a78bbf
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch removes the global mntinfo_tree and collect_mount_info where
it was constructed. The mntinfo list is filled from dump_mnt_ns,
rst_collect_local_mntns, collect_mnt_namespaces and read_mnt_ns_img.
A mountinfo entry contains a reference on a proper ns_id entry, so
we cau use mnt_id to look up a proper mount namespace.
v2: remove trash after rebasing.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mntns, so the global mntinfo_tree
variable are useless and information about tree should be connected
to a proper namespace.
But when we don't dump mntns, we need to collect mounts for the current
mntns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mount namespaces, so files can be opened
from more than one namespace and a root must be collect for each file.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
One device can be mounted a few times, so files are identical only,
if they have the same mnt_id.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
because we want to check, that all files are reachable.
For that we need to collect all mounts from all namespaces.
v2: dump mntns separately
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now we supports sub-mntns, so root_ns_mask sounds more correct than
current_ns_mask.
v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This as well gives us minus one image per-task and
allocates more space on core task entry.
One thing to note -- the amount of posix timers is
not easily accessible at the core entry allocation
time, so the respective array is allocated on demand.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This allows to have one image less per-task, which in turn
reduces live migration time a little bit.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CRIU can handle stopped multithreaded processes when all threads
are stopped. Refine the check to allow this case.
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
An mmaped file is opened O_RDONLY or O_RDWR depending on the permissions
on the first vma dump_task_mm() encounters mapping that file. This
causes two problems:
1. If a file has multiple MAP_SHARED mappings, some of which are
read-only and some of which are read-write, and the first encountered
mapping happens to be read-only, the file will be opened O_RDONLY
during restore, and mmap(PROT_WRITE) will fail with EACCES, causing
the restore to fail.
2. If a file is opened read-write and mapped read-only, it will be
opened O_RDONLY during restore, so restore will succeed, but
mprotect(PROT_WRITE) on the read-only mapping after restore will
fail.
To fix both of these, record open flags per-vma based on the presence of
VM_MAYWRITE in smaps.
Signed-off-by: Jamie Liu <jamieliu@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On PI machine we've got
| CC protobuf.o
| pstree.c: In function ‘core_entry_alloc’:
| pstree.c:36:10: error: ‘RLIM_NLIMITS’ undeclared (first use in this function)
due to old kernel headers. Note I've dropped off
BUG_ON here to localize all things in pstree code,
no need to sprinkle constants.
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The array element is RlimitEntry properly initialized,
no need in additional memcpy-s and size-checks.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We're using new image format, but old image file
is still generated. This will be addressed in
next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Note the restore remains as is for a while, it'll
be addressed later.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In commit 459828b6 I suddenly broke backward
compatibility of auxv vector on 32bit machines.
Bring it back.
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When writing VMAs we perform too many small writes into vma-.img files.
This can be easily fixed by moving the vma-s into mm-s, all the more
so they cannot be splitted from each other.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we will read all VmaEntries in one big MmEntry object,
so to avoif copying them all into vma_areas, make them be pointable.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The plan is to merge vma images into mm ones (see further
patching), so prepare the dumping code for that.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When parsing mappings in proc, we fstat vm file, later,
when dumping it, we stat it again to fill fd_parms.
The 2nd stat is not required, we can keep the stat in
vma_area.
This removed 35% of all stat calls on dump of basic container.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Quite a lot of VMAs in tasks map the same file with different
perms. In that case we may skip opening all these files, but
"borrow" one from the previous VMA parsed.
There's little sense in seeking more that just previous VMA,
as same files are rarely (can be though) mapped in different
locations.
After this on a basic Centos6 container the number of opens and
stats in this function drops from ~1500 to ~500.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping fsnotifies we may go to irmap to get inode->path
mapping. The irmap engine scans FS (in hinted locations) to
get one and it is slow even though we scan only part of the FS.
Since the above scanning is done while tasks are frozen the
freeze time goes up :(
Improve the situation by generating irmap cache in working dir
at pre-dump when tasks get unfrozen.
The on-disk irmap cache is PB file, it sits in -W directory
and can be loaded on dump/pre-dump start in memory. When
resolving the inode->path mapping irmap may meet these entries,
revalidate them and potentially save time.
After pre-dump the (re-)collected irmap data is written back
to irmap cache image. Typically entries written back are the
same read in on cache load.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We will generate some info about file-descriptors at that
stage. For now these pre-dumped ones would be fsnotifies,
so the pre-dump of a single fd is written as simple as
possible, but enough for that type of FDs pre-dump.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Well, we want to pre-dump files (fsnotifies), for that we
will need mountinfo-s and root, and for the latter -- the
current ns mask.
The problem with current ns mask is that its generation is
incorporated into ns IDs generation and dumping. And since
the ids dumping is not performed on pre-dump, let's just
provide a helper for ns-mask generation.
Strictly speaking, the whole ns-mask idea is not great, but
it's to be fixed later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Service will call the pre-dump routine, so this is factoring out
enforcin options for CLI and RPC.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
@vma_area_list::longest is in pages not bytes.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
While doing pre-dump we don't do proper VDSO fixups, thus at
this stage we may fail the should_dump_page() checks -- it
will tread VDSO pages are 'regular' and may skip dumping some
of them.
This is not bad as is, but the subsequent dump will properly
spot VDSO are and will try to dump _all_ pages from it. And
if checks for soft-dirty will report that some pages are clean,
dump will try to locate those in parent images and would fail.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's a feature of PTRACE_SEIZE. So we need to do something, only
if we want to change the state.
[xemul: If task _was_ in stopped state before dump and we want them
to stay alive after dump, the existing code queues one more STOP
to it. This affects subsequent dump, as we seize a stopped task
with STOP in queue.
One more item in TODO list -- support stopped tasks with STOP in
queue :)
]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>