377763e5 is incorrect since we can't always chop off the last element in
the buffer:
Execute static/cgroup00
./cgroup00 --pidfile=cgroup00.pid --outfile=cgroup00.out --dirname=cgroup00.test
Dump 12819
(00.003514) Error (files-reg.c:624): Can't create link remap for /dev/nul. Use link-remap option.
(00.003523) Error (cr-dump.c:1257): Dump files (pid: 12819) failed with -1
(00.004042) Error (cr-dump.c:1619): Dumping FAILED.
WARNING: cgroup00 returned 1 and left running for debug needs
Test: zdtm/live/static/cgroup00, Result: FAIL
==================================== ERROR ====================================
Test: zdtm/live/static/cgroup00, Namespace:
================================= ERROR OVER =================================
Hopefully the >= will appease coverity (instead of just a ==).
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a classical off-by-one error. If sizeof(buf) is 512,
the last element is buf[511] but not buf[512].
Reported by Coverity, CID 114624, 114622 etc.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When in a userns, tasks can't write to certain sysctl files:
(00.009653) 1: Error (sysctl.c:142): Can't open sysctl kernel/hostname: Permission denied
See inline comments for details on affected namespaces.
Mostly for my own education in what is required to port something to be
userns restorable, I ported the sysctl stuff. A potential concern for this
patch is that copying structures with pointers around is kind of gory. I
did it ad-hoc here, but it may be worth inventing some mechanisms to make
it easier, although I'm not sure what exactly that would look like
(potentially re-using some of the protobuf bits; I'll investigate this more
if it looks helpful when doing the cgroup user namespaces port?).
Another issue is that there is not a great way to return non-fd stuff in
memory right now from userns_call; one of the little hacks in this code
would be "simplified" if we invented a way to do this.
v2: coalesce the individual struct sysctl_req requests into one big
sysctl_userns_req that is in a contiguous region of memory so that we
can pass it via userns_call. Hopefully nobody finds my little ascii
diagram too offensive :)
v3: use the fork/setns trick to change the syctl values in the right ns for
IPC/UTS nses; see inline comment for details
v4: only use sysctl_userns_req when actually doing a userns_call.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's required for dumping tmpfs, where we use tar to save content.
If we need to execute tar from a proper userns to get right uid-s and
gid-s for files.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's rudiment. close_old_fds() closes all extra descriptors.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we call open_proc(PROC_SELF, ...) the /proc/self descriptor is
cached in criu. If the process fork()-s after than and child goes
open_proc(PROC_SELF, ...) then it will get the parent's proc descriptor.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
In AArch64, pages may be 4K or 64K depending on kernel configuration.
The GNU C Library documentation suggests [1], "the correct interface
to query about the page size is sysconf". Introduce one new
architecture-specific function-like macro, page_size(), that on x86
and AArch32 remains a constant so as to minimally affect performance,
but on AArch64 is sysconf(_SC_PAGESIZE) for correctness.
1. https://www.gnu.org/software/libc/manual/html_node/Query-Memory-Parameters.html
To minimize churn, the PAGE_SIZE macro is left as a build-time
estimation of what the run-time page size might be.
This fixes the following errors for CRIU on AArch64 kernels with
CONFIG_ARM64_64K_PAGES=y, allowing dump of
`setsid sleep < /dev/null &> /dev/null` to succeed.
Error (kerndat.c:48): Can't stat self map_files: No such file or directory
Error (util.c:668): Can't read pme for pid 90: No such file or directory
Error (parasite-syscall.c:1135): Can't open 89/map_files/0x3ffb7da0000-0x3ffb7dac000 on procfs: No such file or directory
Signed-off-by: Christopher Covington <cov@codeaurora.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Force-read came from very first dev version of CRIU (even before 1.0 release)
and never been used actually in image.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are cases where a process's file descriptor cannot be restored
from the checkpoint images. For example, a pipe file descriptor with
one end in the checkpointed process and the other end in a separate
process (that was not part of the checkpointed process tree) cannot be
restored because after checkpoint the pipe will be broken.
There are also cases where the user wants to use a new file during
restore instead of the original file at checkpoint time. For example,
the user wants to change the log file of a process from /path/to/oldlog
to /path/to/newlog.
In these cases, criu's caller should set up a new file descriptor to be
inherited by the restored process and specify the file descriptor with the
--inherit-fd command line option. The argument of --inherit-fd has the
format fd[%d]:%s, where %d tells criu which of its own file descriptors
to use for restoring the file identified by %s.
As a debugging aid, if the argument has the format debug[%d]:%s, it tells
criu to write out the string after colon to the file descriptor %d. This
can be used, for example, as an easy way to leave a "restore marker"
in the output stream of the process.
It's important to note that inherit fd support breaks applications
that depend on the state of the file descriptor being inherited. So,
consider inherit fd only for specific use cases that you know for sure
won't break the application.
For examples please visit http://criu.org/Category:HOWTO.
v2: Added a check in send_fd_to_self() to avoid closing an inherit fd.
Also, as an extra measure of caution, added checks in the inherit fd
look up functions to make sure that the inherit fd hasn't been reused.
The patch also includes minor cosmetic changes.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This commit is in preparation for the (hopefully last :) restore special cpuset
patch.
Previously, we installed the cgroup service fd after calling
prepare_cgroup_dirs, which meant that we had to carry around the temporary
directory name in order to put things in the right place. The
restore_cgroup_prop function uses the cg service fd instead of carrying around
the full path. This means that we can't sue restore_cgroup_prop, without first
sanitizing the path. Instead, we install the service fd before calling
prepare_cgroup_dirs, and all the code just references that instead of carrying
around the temporary path.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When dumping tasks we do a lot of open_proc()-s and to
speed this up the /proc/pid directory is opened first
and the fd is kept cached. So next open_proc()-s do just
openat(cached_fd, name).
The thing is that we sometimes call open_proc(PROC_SELF)
in between and proc helpers cache the /proc/self too. As
the result we have a bunch of
open(/proc/pid)
close()
open(/proc/self)
close()
see-saw-s in the middle of dumping tasks.
To fix this we may cache the /proc/self separately from
the /proc/pid descriptor. This eliminates quite a lot
of pointless open-s and close-s.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When service starts page server all the preparations (log, wdir, img dir, etc.)
happen in parent task, then we fork page server.
This is OK for now, but when we will serve several requests per connection, all
these resources would be leaked in parent.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We can't dump netlink socket, inotify, fanotify, if they have queued
data, so lets add a function to chech this.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When doing a restore for LXC, we store some other metadata (which bridge a veth
was on) in the image directory so that the restore script can correctly unlock
a network device and attach it to the right interface. This patch is needed so
that the script can find this metadata.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Before we would not detect the mount point for co-mounted controllers. Things
still worked because we'd just re-mount them ourselves and traverse our own
mount point, but this saves an extra mount().
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
During the dump phase, /proc/cgroups is parsed to find co-mounted cgroups.
Then, for each task /proc/self/cgroup is parsed for the cgroups that it is a
member of, and that cgroup is traversed to find any child cgroups which may
also need restoring. Any cgroups not currently mounted will be temporarily
mounted and traversed. All of this information is persisted along with the
original cg_sets, which indicate which cgroups a task is a member of.
On restore, an initial phase creates all the cgroups which were saved. Tasks
are then restored into these cgroups via cg_sets as usual.
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is for debug purpose mostly.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a set of routines that open /proc/$pid files via proc service
descriptor. Teach them to accept non-pids as pids to open /proc/self/*
and /proc/* files via the same engine.
Signed-f-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a race. Consider we have 3 tasks, A, B and C. A and B
share fdtable, C -- does not. Then we might be in a situation
when A is restoring memory reading mem images, and B -- forking
the C child. In that case descriptors held by A (for mem restore)
will be inherited by C and will not get closed.
This reverts commit d36e07aabe073993d8ae9695e33f6e45b2eb6a21.
For all other tasks only unsed service descriptors will be closed.
This change allows to have file descriptors, which may be used for
restoring namespaces. All non-server descriptors must be closed before
restoring files.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Opening /dev/null may fail, check for ret code.
CID 1168167
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The --exec-cmd option specifies a command that will be execvp()-ed on successful
restore. This way the command specified here will become the parent process of
the restored process tree.
Waiting for the restored processes to finish is responsibility of this command.
All service FDs are closed before we call execvp(). Standad output and error of
the command are redirected to the log file when we are restoring through the RPC
service.
This option will be used when restoring LinuX Containers and it seems helpful
for perf or other use cases when restored processes must be supervised by a
parent.
Two directions were researched in order to integrate CRIU and LXC:
1. We tell to CRIU, that after restoring container is should execve()
lxc properly explaining to it that there's a new container hanging
around.
2. We make LXC set himself as child subreaper, then fork() criu and ask
it to detach (-d) from restore container afterwards. Being a subreaper,
it should get the container's init into his child list after it.
The main reason for choosing the first option is that the second one can't work
with the RPC service. If we call restore via the service then criu service will
be the top-most task in the hierarchy and will not be able to reparent the
restore trees to any other task in the system. Calling execve from service
worker sub-task (and daemonizing it) should solve this.
Signed-off-by: Deyan Doychev <deyandoichev@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This reverts commit 66ab5e1ad8ac682c9225446747885184b7bf41b5.
After Andrey's fixes that create mount points before dropping
old mounts and going to pivot_root, this patch is not needed.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We call tar, ip, iptables, etc. when restoring container.
The problem is that these stuff is called from inside new
mount namespace after pivot_root(). But the execvp uses
PATH variable inherited from the host system, which may
not reflect real binaries layout.
Add "/bin" to path as temporary workaround.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
otherwise it won't compile:
util.c: In function ‘cr_daemon’:
util.c:594:8: error: ignoring return value of ‘chdir’, declared
with attribute warn_unused_result [-Werror=unused-result]
chdir("/");
^
Signed-off-by: Tikhomirov Pavel <snorcht@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
On restore we will read all VmaEntries in one big MmEntry object,
so to avoif copying them all into vma_areas, make them be pointable.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The is_foo_link readlinks the lfd to check. This makes
anon-inodes dumping readlink several times to find proper
dump ops. Optimize this thing.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Service shouldn't call client provided scripts, as it
creates a security issue (client may be unpriviledged,
while the service is).
In order to let caller do what it would normally do with
criu-scripts, make criu notify it about scripts. Caller
then do whatever it needs and responds back.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
RPC will start page-server daemon and needs to get the
controll back to report back to caller, but the glibc's
daemon() does exit() in parent context preventing it.
Thus -- introduce own daemonizing routine.
Strictly speaking, this is not pure daemon() clone, as the
parent process has to exit himself. But this is OK for now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case criu check is run as non-root, a lot of information is printed
to a user, with the only missing bit is it should run it as root.
Fix it.
I still don't like the fact that some other stuff is printed here,
like the timestamp and the __FILE__:__LINE__, but this should be
fixed separately.
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>