Place child reapers of pid namespaces at the beginning
of pstree_item::children list and sort them by nesting
level.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently, one feature is supported. Add possibility
for a test to depend on several features.
v2: Delete excess "if" as suggested by Andrey Vagin.
Rename variables to decrise patch size.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Glibc has BUG with process creation:
https://sourceware.org/bugzilla/show_bug.cgi?id=21386
It doesn't behave well when parent and child are from
different pid namespaces and have the same pid.
Use raw syscall without glibc's asserts as workaround.
Also, use raw syscall for getpid() in tests too,
as these two function go in the pair (glibc's getpid()
relies on glibc's fork()).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When new rb_root is created for pidns it is initialized with
RB_ROOT, so ns->pid.rb_root.rb_node is NULL at first. Later
then insert first node in lookup_create_pid() to these rb-tree
it will have (NULL & color) in node->rb_parent_color.
So the check "!rb_parent(&found->ns[i].node)" will be true for
the rb-tree's root node, and criu will fail lookup these node.
We haven't hit that yet as to get to these check we need task in
at least two levels of pidns which at the same time is the root
in rb-tree on e.g. level 0.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This minimize chances to hit problem where files
used for page transfer are trying to use same number
reserved for service fd.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Will need it to unlimit the files allocation
for service fd reserving and later for parasite code run
(which is implemented in vz7 instance and soon will be
ported into vanilla).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Add a fake fd type for autofs. This allows functions
like find_file_desc() work as expected, without
having two different file_desc with the same type
and same id.
Also, later, it will allow to delete autofs_create_fle()
and to use generic helper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
1)Use CLONE_VFORK to create subprocess, as it's safe after patch
"clone_noasan: Allow to create CLONE_VM|CLONE_VFORK processe".
2)add more CLONE_XXX to flags to speedup the syscall.
3)Do not send SIGCHLD, as parent sees child's exit() synchronuos anyway.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Picked from patch "[PATCH RFC] namespaces: use CLONE_VFORK
with CLONE_VM when it is possible" by Andrew Vagin.
Currenly parent touches child's stack, as in moment of clone() call
its stack pointer is above the child's (we allocate char stack[128]
on parent's stack). This prevents to create CLONE_VM|CLONE_VFORK
processes, because the child uses stack addresses occupied by parent.
The patch changes clone_noasan() behaviour and allows to do that
with the same memory consumption. We give a child memory, which
is not used by parent clone(), so parent's and child's stacks
have no tntersection.
This allows to create CLONE_VM|CLONE_VFORK processes.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Move switch_ns() down because __pstree_pid_by_virt()
does not need cleanup.
Add more goto labels and restore ns back in case of fail.
Also delete pr_err(), because the error is already printed
by request_set_next_pid().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Before this patch we used flock to order task creation,
but this way is not good. It took 5 syscalls to synchronize
a creation of a single child:
1)open()
2)flock(LOCK_EX)
3)flock(LOCK_UN)
4)close() in parent
5)close() in child
The patch introduces more effective way for synchronization,
which executes 2 syscalls only. We use last_pid_mutex,
and the syscalls number sounds definitely better.
v2: Don't use flock() at all
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Group them for 1)error and 2)parent cases. This minimize the code
and will be used in next patches.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's impossible to create a task from a pid_ns if its helper
is not created, because we wait in wait_pid_ns_helper_prepared()
for that. So, such situation here is a bug.
Move the wait and convert it to BUG().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Get pid_ns fd from INIT_PID task of this namespace and
use switch_ns() and restore_ns().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In next patches, root_item will need to have its real pid
to be sure, usernsd already sees it.
Also add a comment, explaning why set real pid in two places.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We never call this function for root_item.
It's for dropping user ns, which may happen
with the rest of tasks only.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
memcpy() is not need here, as we rewrite all the fields later.
Also, use PID_SIZE() helper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When INIT_PID of a pid_ns emergency exits, kernel
kills all processes belonging to the namespace.
So, it's hopelessly to wait helper answer to destroy
request. Use kill() to destroy instead of that.
It will be noop in case of a handler is already
killed, and we won't stuck.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
After last patches for net ns the test works again (as envinronment changed),
so return it back.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
pid, net, ipc, uts, mnt ids exist always, and we check
for them when we are reading ids img (see previous
patch "pstree: Check for always existing task ids").
Also, pstree_item::ids exist always too (we populate
them even for dead tasks, see read_pstree_image()).
So, delete the excess checks and simplify the code.
Also, in restore_one_alive_task() check for has_user_ns_id
instead of ids, as ids always exist.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Limit the scope of this macros and make visible its borders.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
has_pid_ns_id is checked above. In could go together with
previous patch, but I separated them for easier review.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
All alive task must have have ids and the fields,
implemented before the img format became stable
(see commit 2105e18eee).
Check for them in the only place (in additional to the
check for has_pid_ns_id, which we already have)
and this will allow to remove checks for item->ids and
for item->ids->has_xxx_ns_id from the rest of code
and make it simplier. See patch "pstree: Delete checks
of always existing pstree_item::ids on restore" in further)
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This is already checked, when we check for parent->ids.
So, delete excess check.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Add missed return on memory allocation fail branch.
Found by coverity.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Do not use pid namespace helper when there is one-level pid.
If it's one-level, then the created task is in root pid ns.
Also, as a parent's level is less or equal a child's,
then parent is in root pid ns too. So, write next pid directly.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
That is for if-unset check in dump_task_thread(), which compares virt
and -1. It is ok not to initialize virt if kernel has NSpid in
/proc/pid/status as parse_pid_status() will rewrite zeroes, but on
VZ7 kernel it will fail:
https://ci.openvz.org/job/CRIU/job/CRIU-virtuozzo/job/criu-dev/2021/
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
======================== Run zdtm/static/clone_fs in h =========================
Start test
./clone_fs --pidfile=clone_fs.pid --outfile=clone_fs.out
Run criu dump
=[log]=> dump/zdtm/static/clone_fs/24/1/dump.log
------------------------ grep Error ------------------------
(00.007511) Dumping general registers for 25 in native mode
(00.007525) Dumping GP/FPU registers for 25
(00.007535) 25 has 0 sched policy
(00.007542) dumping 0 nice for 25
(00.007549) Error (criu/cr-dump.c:863): Parasite and /proc/[pid]/status gave different tids
------------------------ ERROR OVER ------------------------
Run criu restore
=[log]=> dump/zdtm/static/clone_fs/24/1/restore.log
------------------------ grep Error ------------------------
(00.000497) Add user ns 2 pid 24
(00.000500) Add pid ns 1 pid 24
(00.000503) Add ipc ns 4 pid 24
(00.000506) Add uts ns 5 pid 24
(00.000514) Error (criu/pstree.c:501): Can't skip zero pids levels (0) or find {parent,} ns (1)
(00.000520) Error (criu/pstree.c:813): BUG at criu/pstree.c:813
------------------------ ERROR OVER ------------------------
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The root task can live in another netns and it has to be restored
before executing setup-namespaces scripts.
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Looks-good-to: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
runc restore executes criu with --emptyns network and set
a setup-namespaces script to restore a network namespace.
https://github.com/xemul/criu/issues/314
Looks-good-to: Pavel Emelyanov <xemul@virtuozzo.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 2189b9c71d3d ("net: allow to dump and restore more than one network namespace")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Allow nested pid_ns, but turn restoring of pgid and sid off for the cases,
when there are child pid namespaces. This functionality will be realized
by Pavel Tikhomirov, he is working on that.
v4: Also make restore_before_setsid() always return false if there are
child pid namespaces
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Make the sanity check working in case of mutli-level pids.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If there is multi-level pid, ask helpers to populate /proc last pids
in their active pid namespaces. So, thread will be created with right
NStids.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We need a socket to request NStids for tasks threads.
Transport socket will be used for that in next patches.
So, close it later, after threads are created.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Request helpers to set ns_last_pid in their active pid_ns.
Of course, optimizations are possible here, but not for now.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Since child's pid_ns may have user_ns not equal
to parent's, and we do not want to lose parent's
user_ns (as it's not impossible to restore it back),
create the child from a sub-process.
v3: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>