We close it in sigreturn_restore() for unification with other
service fds, so kill the second close() from here.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This minimize chances to hit problem where files
used for page transfer are trying to use same number
reserved for service fd.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Will need it to unlimit the files allocation
for service fd reserving and later for parasite code run
(which is implemented in vz7 instance and soon will be
ported into vanilla).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Add a fake fd type for autofs. This allows functions
like find_file_desc() work as expected, without
having two different file_desc with the same type
and same id.
Also, later, it will allow to delete autofs_create_fle()
and to use generic helper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In case we have mounts:
1 /mnt/
2 /mnt/a with parent 1
3 /mnt/a/b with parent 1
4 /mnt/a with parent 2
We determine 2 as needing remap with does_mnt_overmount() and remap it.
Next we mount 4 on top of 2. Next in fixup_remap_mounts() we want to
move 2 back to it's parent 1, but instead move 4 there. So in these case
children-overmounts need to be remapped too.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Remaps in mnt_remap_list should follow same descending order which was
setup in mnt_resort_siblings(), so don't reorder them.
For instance if we have sibling mounts with mountpoints:
1) /dir1/dir2/dir3
2) /dir1/dir2
3) /dir1
Here (2) is sibling-overmount for (1). Mount (3) is sibling-overmount
for both (1) and (2). So when we move overmounts back in
fixup_remap_mounts() we should first move (2) and only then (3).
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We should add new entry _before_ first entry with less depth to sort in
descending order.
e.g: entries in list have depths [7,5,3], adding new entry m with depth
4 we would break list_for_each_entry loop on p with depth 3, before
patch we would get [7,5,3,4] after list_add, which is wrong.
Also we can relax "<=" check to "<" to avoid unnecessary reordering.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
dump of VZ7 ct fails, if we have overmounted tmpfs inside:
[root@silo ~]# prlctl enter su-test-2
entered into CT
CT-829e7b28 /# mkdir /mnt/overmntedtmp
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt/overmntedtmp/
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt
CT-829e7b28 /# logout
[root@silo ~]# prlctl suspend su-test-2
Suspending the CT...
Failed to suspend the CT: PRL_ERR_VZCTL_OPERATION_FAILED (Details: Will skip in-flight TCP connections
(01.657913) Error (criu/mount.c:1202): mnt: Can't open ./mnt/overmntedtmp: No such file or directory
(01.662528) Error (criu/util.c:709): exited, status=1
(01.664329) Error (criu/util.c:709): exited, status=1
(01.664694) Error (criu/cr-dump.c:2005): Dumping FAILED.
Failed to checkpoint the Container
All dump files and logs were saved to /vz/private/829e7b28-f204-4bce-b09f-d203b99befd4/dump/Dump.fail
Checkpointing failed
)
Criu wants to dump the contents of /mnt/overmntedtmp/ mount but it is
unavailable. So we copy the mount namespace in such a case and unmount
overmounts to access what we want to dump.
Actual usecase here is dumping CT with active mariadb and ssh
connection. Together they happen to create such overmount. As by default
systemd creates a separate mount namespace for mysql and also mounts
tmpfs to /run/user in it, and when ssh(root) is connected - systemd also
mounts tmpfs in container root mount namespace to /run/user/0 for user
files. As /run is slave mount /run/user/0 also propagates to mysql's
mount namespace and initially becomes overmounted by /run/user.
https://jira.sw.ru/browse/PSBM-57362
remove __maybe_unused for mnt_is_overmounted and umount_overmounts
changes in v2:
1) Use clone not fork, share resources with parent same as in
call_in_child_process.
2) Do not enter userns (create helper) for non-overmounted mounts. Thus
return back setns/resorens logic.
3) Helper opens fd for parent directly due to CLONE_FILES, remove futex.
4) Check helper exit status properly.
5) Add get_clean_fd helper.
6) Add better comments.
changes in v3:
1) Pass fd from helper through args instead of ret code, fix ret code
checking.
2) Add \n to pr_err in open_mountpoint
changes in v5:
Make comments even better.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
also remove __maybe_unused for __umount_children_overmounts
note: leave it __maybe_unused yet
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The origin idea was to set --empty net for criu dump and criu restore,
but before cde33dcb06 ("empty-ns: Don't C/R iptables too (v2)"),
criu restore worked without --empty net and we didn't notice that
docker doesn't set this option on restore.
After a small brainstorm, we decided that it is better to remove
this requirement. Docker has to set this option, but with this changes,
the docker issue will be less urgent.
https://github.com/checkpoint-restore/criu/issues/393
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There's a
if (bad_thing) {
ret = -1;
break;
}
code above this hunk, whose intention is to propagate -1 back to
caller. This propagation is obviously broken.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
runc restore executes criu with --emptyns network and set
a setup-namespaces script to restore a network namespace.
https://github.com/xemul/criu/issues/314
Looks-good-to: Pavel Emelyanov <xemul@virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 2189b9c71d3d ("net: allow to dump and restore more than one network namespace")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Picked from patch "[PATCH RFC] namespaces: use CLONE_VFORK
with CLONE_VM when it is possible" by Andrew Vagin.
Currenly parent touches child's stack, as in moment of clone() call
its stack pointer is above the child's (we allocate char stack[128]
on parent's stack). This prevents to create CLONE_VM|CLONE_VFORK
processes, because the child uses stack addresses occupied by parent.
The patch changes clone_noasan() behaviour and allows to do that
with the same memory consumption. We give a child memory, which
is not used by parent clone(), so parent's and child's stacks
have no tntersection.
This allows to create CLONE_VM|CLONE_VFORK processes.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
(Was "user_ns: Block SIGCHLD during namespaces generation")
We don't want asynchronous signal handler during creation
of namespaces (for example, in create_user_ns_hierarhy())
as we do wait() synchronous. So we need to block the signal.
Do this once globally.
v2: Set initial ret = 0
v3: Block signal globally in root_item before its children
are created.
v4: Move block to prepare_namespace()
Suggested-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In next patches usernsd will need to create transport
socket in the same net_ns as other tasks do their
TRANSPORT_FD_OFF sockets.
Choose criu net_ns for that: this allows usernsd
to do not wait for creation of other net_ns, i.e.
to do not introduce new dependencies between tasks.
In case of (root_ns_mask & CLONE_NEWUSER) != 0
root_item's user_ns does not allow to restore criu net_ns,
so do prepare_net_namespaces() in sub-process to do not
lose criu net.
v3: Introduce __prepare_net_namespaces and execute it in cloned task.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Since net ns is assigned after prepare_fds() and,
in common case, at the moment of open_ns_fd() call
task points to a net ns, which differs to its target
net ns, we can't get the ns from a task. So, get it
from fdstore. Also, support userns ns fds.
v2: Add comment
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We shave a test case for external veth devices. This test case
checks veth devices which are living in two dumped network
namespaces.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
A network device, which is connected to a bridge, is restored
after the bridge. In this case we can set the master attribute and
the device will be connected to the bridge automatically.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When we dump a veth device, the kernel reports where a peer device lives
and we use this information to restore this veth pair.
On restore we set a net ns id for a peer and it is created in the required
netns.
v2: add more comments
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In each network namespace we can set an id for another network namespace
to be able to address it in netlink messages.
For example, we can say that a peer of a veth devices has to be created
in a network namespace with a specified id. If we request information about
a veth device, a kernel will report where a peer device lives.
An user are able to set this ID-s, so we have to dump and restore them.
v2: add more commetns
v3: make a union of nsfd_id and ns_fd, they are not used together
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Each network namespaces has a list of ID-s for other namespaces,
so if we request infomation about a veth device, we get an id
for a namespace of a peer device.
These ID-s can be set by users or by kernel when they are required.
CRIU has to restore these ID-s for network namespaces. We have to
remember that one netns can have different id-s in different network
namespaces.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is possible to assign id for network namespaces and
this id will be used by the kernel in some netlink messages.
If no id is assigned when the kernel needs it, it will be
automatically assigned by the kernel.
For example, this id is reported for peer veth devices.
v2: add a comment
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This tests create a few processes which live in three network namespaces
and have a few sockets which are created in different network namespaces.
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Each sockets belongs to one network namespace and operates
in this network namespace.
socket_diag reports informations about sockets from
one network namespace, but it doesn't report sockets which
are not bound or connected to somewhere. So we need to have
a way to get network namespaces for such sockets.
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This ioctl is called for a socket and returns a file descriptor
for network namespace where a socket has been created.
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Each socket has to be restored from a proper network namespaces
where it was created.
We set a specified network namespace before restoring a socket.
A task network namespace is set after restoring all files.
v2: don't set the root netns for transport sockets
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We need this to avoid conflicts with file descriptors,
which has to be restored.
Currently open_proc_pid() doesn't used during restoring
file descriptors, but we are going to use it to restore
sockets in proper network namespaces.
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Restore all network namespaces from the root task and then set
a proper namespace for each task after restoring sockets, because
we need to switch network namespaces to restore sockets.
Each socket has to be created in a proper network namespace.
v2: fix a typo bug
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Each socket has to be restored in a proper namespaces where
it has been created.
Here is an issue about unconnected and unbound sockets,
they are not reported via socket-diag and we can't to
get their network namespaces.
v2: add a comment before get_socket_ns()
remove nsid from sk_packet_entry
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
PID ussualy means processs ID, but prepare_net_ns works with namespaces.
travis-ci: success for Dump and restore nested network namespaces (rev4)
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
ns_id will be used to collect sockets and other per-netns
resources
travis-ci: success for Dump and restore nested network namespaces (rev4)
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>