vvar zone is mapped by a kernel and must not ever
been dumped into image, the data present there is
valid on running kernel only.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Will need it to handle vvar zones in a special way.
Because VMA_UNSUPP never goes into the image file
lets reuse bit 12 for VVAR.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The /dev directory is also created by zdtm when running ns/ enabled tests.
Add it to the list, together with entries such as /bin and /lib.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This confirms that the fix to handle dumpable flag set to 2 still works after
restore.
To force dumpable flag set to 0 or 2 (whatever the fs.suid_dumpable is set to),
chmod the test binary to 0111 (executable, but not readable) and execv() it
while running as non-root. The kernel will unset the dumpable flag to prevent
a core dump or ptrace to giving the user access to the pages of the binary
(which are supposedly not readable by that user.)
Tested:
- # test/zdtm.sh static/dumpable02
Test: zdtm/live/static/dumpable02, Result: PASS
- # test/zdtm.sh ns/static/dumpable02
Test: zdtm/live/static/dumpable02, Result: PASS
- Used -DDEBUG to confirm the value of the dumpable flag was 0 or 2 to match
the fs.suid_dumpable sysctl in the tests (both in and out of namespaces.)
- Confirmed that the test fails if the commit that fixes handling of dumpable
flag with value 2 is reverted and the fs.suid_dumpable sysctl is set to 2.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Commit d5bb7e9748 started to preserve the dumpable flag across migration by
using prctl to get the value on dump and set it back on restore.
On some situations, the dumpable flag can be set to 2. This happens when it is
not reset (with prctl) after using setuid() or after using execv() on a binary
that has executable but not read permissions, when the fs.suid_dumpable sysctl
is also set to 2. However, it is not possible to set it to 2 using prctl,
which would make criu restore fail.
Fix this by checking for the value before passing it to prctl. In case the
value of the dumpable flag was 2 at the source, check whether it is already 2
at the destination, which is likely to happen if the fs.suid_dumpable sysctl is
also set to 2 where restore is running. In that case, preserve the value,
otherwise reset it to 0 which is the most secure fallback.
Fixes: d5bb7e9748
Tested:
- Using dumpable02 zdtm test after setting fs.suid_dumpable to 2.
# sysctl -w fs.suid_dumpable=2
# test/zdtm.sh ns/static/dumpable02
4: DEBUG: before dump: dumpable=2
4: DEBUG: after restore: dumpable=2
4: PASS
Test: zdtm/live/static/dumpable02, Result: PASS
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This confirms that the fix in commit d5bb7e9748 to preserve the dumpable flag
after migration is working as expected.
In this test case, the dumpable flag is expected to always be set to 1, as
test_init will use prctl to reset it to 1 after using setuid and setgid.
Tested:
- # test/zdtm.sh static/dumpable01
Test: zdtm/live/static/dumpable01, Result: PASS
- # test/zdtm.sh ns/static/dumpable01
Test: zdtm/live/static/dumpable01, Result: PASS
- Confirmed that the test fails after reverting commit d5bb7e9748.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Otherwise I see on 3.16-rc1 and higher
| [ 100.851730] futex wrote to ns_last_pid when file position was not 0!
| This will not be supported in the future. To silence this
| warning, set kernel.sysctl_writes_strict = -1
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Next acheivement -- external bind mounts and tasks-to-cgroups
bindings. Plus many bugfixes in memory restore and mounpoints
dump, many thanks to Google guys for reports and patches!
We have quite a few things left to make workable LXC and Docker
support, hopefully the next tag will be the 1.3 one :)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
[ xemul: It's a temporary workaround not to lock the -rc2 release.
Once we have some better solution, this will be rolled back. ]
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A file system can be bind-mounted a few times and some of these mounts
can be non-root. We need to find one of root mounts and dump it.
v2: don't forget to check pm->dumped and pm->parent
don't dump a root file system, it's always external for now.
Reported-by: Saied Kazemi <saied@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
One file system can be mounted a few times, so mnt_id isn't unique for it.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On dump one uses one or more --ext-mount-map option with A:B arguments.
A denotes a mountpoint (as seen from the target mount namespace) criu
dumps and B is the string that will be written into the image file
instead of the mountpoint's root.
On restore one uses the same --ext-mount-map option(s) with similar
A:B arguments, but this time criu treats A as string from the image's
root field (foobar in the example above) and B as the path in criu's
mount namespace the should be bind mounted into the mountpoint.
v3:
* Added documentation
* Added RPC bits
* Changed option name into --ext-mount-map
* Use colon as key and value separator
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These are mounted by default in ubuntu containers, so criu should know about
them and remount them on restore.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It uses absolute file names, so any open-s should happen _before_
we change tasks' root.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
If fchroot() succeeds the further failures don't get
noticed by caller.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
There's no such thing as fchroot() in Linux, but we need to do
chroot() into existing file descriptor. Before this patch we did
this by chroot()-ing into /proc/self/fd/$fd. W/o proc mounted it's
no longer possible, so do this like
fchdir(proc_service_fd);
chroot("./self/fd/$root_fd");
fchdir($cwd_fd);
Thanks to Andrey Vagin for this trick ;)
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We have a set of routines that open /proc/$pid files via proc service
descriptor. Teach them to accept non-pids as pids to open /proc/self/*
and /proc/* files via the same engine.
Signed-f-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When running test with ns/ prefix zdth.sh does complex preparations.
Make it possible to make them and let started process ready for
manual investigation.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
New vDSO are in stripped format so use dynamic
symbols instead of sectioned ones.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We're not sharing the code anymore so drop it.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This fixes the support for fifo-s in mount namespaces and
makes it easier to control the correct open_path() usage in
the future.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
I will need to make cgroup test behave slightly differently
when it's in and out of ns/ run. To do so it's handy to use
the ZDTM_NEWNS variable set by zdtm.sh
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch consists of 3 unsplittable (from my POV) fixes.
1. Remove messy check from dump_one_mountpoint() -- we have
validate_mounts to check whether we can dump the tree
or not.
2. Other than being in the wron place the mentioned check
is wrong. Comparing of the length of the mp->source-s
makes no sense -- it should be mp->root, but even this
would be wrong...
3. ... instead, we should check for bind mount root path
being accessible from the target mount root path, i.e.
the bind->root should start with src->root.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
The memcpy() in devpts_dump() just overwrites part of them.
Fix this and move the whole code into sub-routine for future.
v2: Fix off-by-one error spotted by Filipe.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Filipe Brandenburger <filbranden@google.com>
We use config.h in vDSO handling code so arch
targets should depend on it.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore find out in which sets tasks live in and move
them there.
Optimization note -- move tasks into cgroups _before_ fork
kids to make them inherit cgroups if required. This saves
a lot of time.
Accessibility note -- when moving tasks into cgroups don't
search for existing host mounts (they may be not available)
and don't mount temporary ones (may be impossible due to
user namespaces). Instead introduce service fd with a yard
of mounts.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Each task points to a single ID of cgroup-set it lives in. This
is done so to save some space in the image, as tasks likely
live in the same set of cgroups.
Other than this we keep track of what cgroup set we dump the
subtree from. If it happens, that root task lives in the same
cgroup set as criu does, we don't allow for any other sub-cgroups
and make restore (next patch) much simpler and faster.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The exact structure of the image will be revealed in the
next patch(es). What is important here, is that cgroup
image is somewhat new.
It will likely contain arrays of objects of different types,
so I introduce the "header" object, that will link these
arrays using pb repeated fields. This will help us to avoid
many image files for different cgroup objects and will make
the amount of write()-s required be 1.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we build vDSO handling code for all archs provided
in the source code having some "common" parts inside pie/vdso.c,
pie/vdso-stub.c, vdso-stub.c and vdso.c. This were more or
less well but in new linux kernels (starting from 3.16 presumably)
the vDSO has been significantly reworked so every architecture
must have own vDSO handling engine (just like the kernel does).
So in this patch we move vDSO code to arch specific and because
aarch64 actually doesn't implement proxification yet due to
kernel restrictions -- we drops it out. When there will be
kernel support we bring it back in proper arch/aarch64
implementation.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Guard vDSO code with CONFIG_VDSO, no need to even build it
on archs which do not support vDSO handling.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>