If we fail to create temporary directory for doing a clean mount we can
make mount clean reusing the code which enters new mountns to umount
overmounts. As when last process exits mntns all mounts are implicitly
cleaned from children, see in kernel source - sys_exit->do_exit
->exit_task_namespaces->switch_task_namespaces->free_nsproxy
->put_mnt_ns->umount_tree->drop_collected_mounts->umount_tree:
/* Hide the mounts from mnt_mounts */
list_for_each_entry(p, &tmp_list, mnt_list) {
list_del_init(&p->mnt_child);
}
Fixes commit b6cfb1ce29 ("mount: make open_mountpoint handle overmouts
properly")
https://github.com/checkpoint-restore/criu/issues/520
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Create a tree of shared mounts where shared mounts have different sets
of children (while having the same root):
First share1 is mounted shared tmpfs and second share1/child1 is mounted
inside, third share1 is bind-mounted to share2 (now share1 and share2
have the same shared id, but share2 has no child), fourth share1/child2
is bind-mounted from share1, and also propagated to share2/child2 (now
all except share1/child1 have the same shared id), fifth share1/child3
is mounted and propagates inside the share.
Finally we have four mounts shared between each other with different
sets of children mounts, and even more two of them are children of
another two:
495 494 0:62 / /zdtm/static/non_uniform_share_propagation.test/share1 rw,relatime shared:235 - tmpfs share rw
496 495 0:63 / /zdtm/static/non_uniform_share_propagation.test/share1/child1 rw,relatime shared:236 - tmpfs child1 rw
497 494 0:62 / /zdtm/static/non_uniform_share_propagation.test/share2 rw,relatime shared:235 - tmpfs share rw
498 495 0:62 / /zdtm/static/non_uniform_share_propagation.test/share1/child2 rw,relatime shared:235 - tmpfs share rw
499 497 0:62 / /zdtm/static/non_uniform_share_propagation.test/share2/child2 rw,relatime shared:235 - tmpfs share rw
500 495 0:64 / /zdtm/static/non_uniform_share_propagation.test/share1/child3 rw,relatime shared:237 - tmpfs child3 rw
503 497 0:64 / /zdtm/static/non_uniform_share_propagation.test/share2/child3 rw,relatime shared:237 - tmpfs child3 rw
502 499 0:64 / /zdtm/static/non_uniform_share_propagation.test/share2/child2/child3 rw,relatime shared:237 - tmpfs child3 rw
501 498 0:64 / /zdtm/static/non_uniform_share_propagation.test/share1/child2/child3 rw,relatime shared:237 - tmpfs child3 rw
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
These also fixes false-propagation problem of the mount to itself if it
is in parent's share.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
1) redo waiting for parents of propagation group to be mounted using
pre-found propagation groups
2) for shared mount wait for children of that shared group which has no
propagation in our shared mount
(2) - effectively is a support of non-uniform shares, that means two
mounts of shared group can have different sets of children now - we will
mount them in the right order, but propagate_mount and validate_shared
are still preventing c/r-ing such shares, will fix the former and remove
the latter in separate(next) patches.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
These information will help improving the restore of tricky mounts
configurations.
Function same_propagation_group checks if two mounts were created
simultaneousely through shared mount propagation, and the main part of
these - they should be in exaclty the same place inside the share of
their parents.
Function root_path_from_parent prints the mountpoint path
relative to the root of the parent's share, by first substracting
parent's mountpoint from our mountpoint and second prepending parents
root path (relative to the root of it's file system), e.g:
id parent_id root mountpoint
1 0 / /
2 1 / /parent_a
3 1 /dir /parent_b
4 2 / /parent_a/dir/a
5 3 / /parent_b/a
(Let 2 and 3 be a shared group)
For mount 4 root_path_from_parent gives:
"/parent_a/dir/a" - "/parent_a" == "/dir/a"
"/" + "/dir/a" == "/dir/a"
For mount 5:
"/parent_b/a" - "/parent_b" == "/a"
"/dir" + "/a" == "/dir/a"
So mounts 4 and 5 are a propagation group.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We should not use ->bind link for checking master's children. As if we
have two slaves shared between each other, the one mounted first will
replace ->bind link for the other - that will break restore.
Also while on it, if we do not want doubled mounts and want to
prohibit propagation to slaves on restore we likely want all children of
the whole master's share mounted before slave.
JFYI: Actually these restriction is very strict and some cases will fail
to restore, for instance (hope nobody does so):
mkdir /test
mount -t tmpfs test /test
mount --make-private /test
mkdir /test/{share,slave}
mount -t tmpfs share /test/share --make-shared
mount --bind /test/share/ /test/slave/
mount --make-slave /test/slave
mount --make-shared /test/slave
mkdir /test/share/slave
mount --bind /test/slave/ /test/share/slave/
cat /proc/self/mountinfo | grep test
524 612 0:69 / /test rw,relatime - tmpfs test rw
570 524 0:73 / /test/share rw,relatime shared:879 - tmpfs share rw
571 524 0:73 / /test/slave rw,relatime shared:942 master:879 - tmpfs share rw
602 570 0:73 / /test/share/slave rw,relatime shared:942 master:879 - tmpfs share rw
603 571 0:73 / /test/slave/slave rw,relatime shared:943 master:942 - tmpfs share rw
Here 603 is a propagation of 602 from master 570 to slave 571, and it is
the only way to get such a mount as 571 and 602 are in one shared group
now and all later mounts to them will propagate between them and create
dublicated mounts. So to create real 603 without dups we need to have
/test/slave mounted before /test/share/slave, which contradicts with
current assumption.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
See more detailed explanation inside in-code comment.
note: Actually before we remove validate_mounts (later in these
patchset) we likely won't get to these check and fail earlier, as having
children collision implies shared mounts with different sets of
children.
note: from v4.11 and ms kernel commit 1064f874abc0 ("mnt: Tuck mounts
under others instead of creating shadow/side mounts.") there will be no
more mount collision.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
We use fdstore intensively for example when handling
bindmounted sockets and ghost dgram sockets. The system
limit for per-socket queue may not be enough if someone
generate lots of ghost sockets (150 and more as been
detected on default fedora 27).
To make it operatable lets unlimit fdstore queue size
on startup.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When we are dumping epoll and one of target fd is been
duped we can reuse already collected fds rbtree to find
proper target. We handle it in a lazy way:
- try use plain regular bsearch first, in case of all
targets are not duped we checkpoint epoll immediately
- if bsearch failed we put this epoll entry into a queue
and run its dumping later when all other files in the
process are already dumped. At this moment fds tree
should already has all target files in rbtree thus
we can simply lookup for it
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is used in files tree generation so we will need
reuse for epoll sake.
Also use the whole 64 bit offset to shuffle bits more.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
To find target files with help of our collected
rbtree.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If we can't find target file descriptor we should
exit on dump with error instead of skipping it.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will use them to fast lookup of targets files.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
To figure out efd:tfd mapping easier by reading the logs.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When target file obtained from epoll fdinfo (internally the
kernel keeps only file _number_ inside) we have to check its
identity to make sure it is exactly one which has been added
into epoll engine. The only proper way is to use kcmp syscall.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When we are checkpoiting epoll targets we assuming that this target
file is belonging to the process we are on. This is of course not
true. Without kernel support the only thing we can do is compare
fd numbers with ones present in epoll fdinfo. When fd numer match
we assume that it indeed the file which has been added into epoll.
This won't cover the case when file has been moved to some other
number and new one is reopened instead of it. Such scenario will
trigger false positive and we can't do anything about.
In next patches with kernel help we will make precise check for
files identity.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In epoll dumping we will need the whole set of fds to investigate
the targets, so pass this parameter down to epoll code.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will need it to make sure the target files in epolls are present
in current process.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
- switch to use uintX type (just to drop uX finally,
it doesn't worth to carry this type)
- instead of including huge util.h rather include the
files which are really needed: log, xmalloc, compiler
and bug
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Before this patch we used flock to order task creation,
but this way is not good. It took 5 syscalls to synchronize
a creation of a single child:
1)open()
2)flock(LOCK_EX)
3)flock(LOCK_UN)
4)close() in parent
5)close() in child
The patch introduces more effective way for synchronization,
which executes 2 syscalls only. We use last_pid_mutex,
and the syscalls number sounds definitely better.
v2: Don't use flock() at all
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
I think, we should warn a user when we can't C/R compatible
applications. That's valid for different than x86 archs.
Let's correct the message the way it'll suit non-x86.
Reported-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When a non-root user runs "criu restore" and criu has the suid bit,
a process will run with non-zero uid and gid.
Before the 4.13 kernel (4d28df6152aa "prctl: Allow local CAP_SYS_ADMIN
changing exe_file"), PR_SET_MM_EXE_FILE fails if uid or gid isn't zero.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Install sudo, create test user with ID 1000, install bash,
fix pidfile creation and pidfile chmod.
v2:
* use sleep to give the criu daemon some time to start up
v3:
* Andrei is of course right and sleep is not good solution.
After adding --status-fd support to criu service, this
is how we now detect that criu is ready.
v4:
* This was much more complicated than expected which is related
to the different versions of the tools on the different travis
test targets. There seems to be a bug in bash on Ubuntu
https://lists.gnu.org/archive/html/bug-bash/2017-07/msg00039.html
which prevents using 'read -n1' on Ubuntu. As a workaround
the result from CRIU's status FD is now read via python.
Another problem was discovered on alpine with the loop restore test.
CRIU says to use setsid even if the process is already using setsid.
As a workaround, still with setsid, this process is now using
shell-job true for checkpoint and restore.
Parts of v2 have been committed before. So the changes from this commit
are partially already in another commit.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Make the --status-fd option also work in 'criu service' mode to avoid
race conditions during testing.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Install sudo, create test user with ID 1000, install bash,
fix pidfile creation and pidfile chmod.
v2:
* use sleep to give the criu daemon some time to start up
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This extends the test.py to also run the RPC command VERSION via 'criu
service'. It was already running using 'criu swrk'.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In this directory there are various test cases using CRIU in RPC mode
(or SWRK mode).
This fixes the broken tests by moving the start of 'criu service' from
run.sh to the Makefile as the test cases is running using "sudo -g
'#1000' -u '#1000'" and the PID file created by CRIU can only be read by
the root user. If starting the 'criu service' before run.sh the PID file
still can be changed to 0666 and fixing the test script.
This also adds version.py to the test cases that are executed.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Return an error if we meet unexpected parameters in a config file
Cc: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Now we rely on scanf, that it will initializes a pointer to NULL, when
it fails to parse a string, but I can't find in a man page, that it has
to do this.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>