Iterate over fake_master_head and add a fake fake fle of root_item,
which becomes new master and have permissions to restore file_desc.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On Thu, Jun 15, 2017 at 12:16, Cyrill Gorcunov wrote:
> On Thu, Jun 15, 2017 at 12:10:43PM +0300, Kirill Tkhai wrote:
> > On Wed, Jun 14, 2017 at 23:32, Andrei Vagin wrote:
> > > On Wed, Jun 07, 2017 at 02:28:53PM +0300, Kirill Tkhai wrote:
> > > > 1)Find such fle, and link it at the beginning of list.
> > > > 2)Order by pid, where possible, if it does not contradict (1)
> > >
> > > Why do we need to order by pid?
> >
> > This was initially, and I left the logic. As I know,
> > it's need for epoll, to place master in parent task.
> >
> > CC: gorcunov@virtuozzo.com
> > Cyrill, could you please say, why we need this, if you remember?
>
> I think it's the same as in bug we met.
> ---
> commit 2df9c9dc6e0b926aaba00138e3e66295ebea76ce
> Author: Cyrill Gorcunov <gorcunov@virtuozzo.com>
> Date: Mon Apr 3 18:38:55 2017 +0300
>
> vz7: files -- Select proper master fd when collecting fd
>
> When choosing the master file which gonna be sending file
> descriptor to the children we must not only look into
> their PIDs but consider process tree relations, in particular
> the child of a process might be choosen as a master and
> epoll restore will fail because target files are simply
> not present in child tree.
>
> | 31964 31964 31964 epoll
> | 585 31964 31964 epoll
> | 586 31964 31964 epoll
> |...
> | (04.797121) 585: Error (criu/eventpoll.c:180): epoll: Unexpected state for tfd (id 0 fd 8)
>
> That's because the target files are blong to 31964 and not
> present in child 585, but because PID wrapp happened it
> has been chosen as a leader which is of course wrong.
>
> https://jira.sw.ru/browse/PSBM-63355
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
[PATCH v3 18/30]files: Choose file master with enough permissions
1)Find such fle, and link it at the beginning of list.
2)Order by pid, where possible, if it does not contradict (1)
3)If there is no a master, leave fdesc in fake_master_head.
v3: Describe pid order reasons
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Returns user_ns of file (currently it's not exported to userspace)
and minimal user_ns need for restore file (for example, socket
net_ns->user_ns, regulating setns() permittions).
This will be need to choose correct process as owner of file master.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The aim is to have top_user_ns set even if !(root_ns_mask & CLONE_NEWUSER).
This allows to avoid additional comparison top_user_ns with NULL elsewhere.
Thus, move fixup for old images to generic code, to support the case above.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
I'm going to use this in !(root_ns_mask & CLONE_NEWUSER) case,
so choose a better name to fit everything.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It will be need for fast obtaining root_item's net_ns,
and to fixup old dumps.
v2: Add a comment to top_xxx_ns. Extend MARK_ROOT_NS().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Just to not allocate path buffer twice.
v2: Change debug message.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Plain wait() waits only children created with SIGCHLD flag.
Add it.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Wait child before daemonization to do not allow
zdtm.py to see child fds and maps before it
becomes zombie.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Wait child before daemonization to do not allow
zdtm.py to see child fds and maps before it
becomes zombie.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The original idea was to sort children and to keep child
reapers at the beginning of the list. But there a mistake
happened: we must look for last_level_pid() as it is
an indicator of a child_reaper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Parasite returns last level pid (pid in task's pid namespace),
so we mustn't rewrite already collected from /proc/[pid]/status
vpid.
We handle that correctly on dump, do the same on pre-dump.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Session 15's(20) leader is in first pidns, one it's process is in
second pidns and one is
in the third. So we create two helpers here for each aditional
pidns.
(It is critical that
Full test now looks like (mind pids here are different(real) from
their id's in source code e.g. 15 is 20 here):
(pid,ppid,sid)
session04(1, 0, 1)───session04(4, 1, 4)───session04(5, 4, 4)───session04(6, 5, 6,pid1)─┬─session04(8, 6, 8)───session04(9, 8, 7)
├─session04(10, 6, 6)───session04(11, 10, 11)
├─session04(13, 6, 13)───session04(14, 13, 11)
├─session04(15, 6, 15)
├─session04(17, 6, 17)─┬─session04(18, 17, 15)
│ └─session04(19, 17, 17,pid2)───session04(22, 19, 20)
├─session04(20, 6, 20)
└─session04(23, 6, 6,pid3)───session04(25, 23, 20)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Demand ns_pid, ns_get_userns and ns_get_parent features, else will
have "Can't do ns ioctl" error in criu:set_ns_opt().
v2:remove unused variable i in cleanup
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Before "pstree: rework init reparent handling for pid namespaces" patch
we would get:
$ ./test/zdtm.py run -t zdtm/static/session01
=== Run 1/1 ================ zdtm/static/session01
======================= Run zdtm/static/session01 in ns ========================
Start test
./session01 --pidfile=session01.pid --outfile=session01.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/session01/31/1/restore.log
------------------------ grep Error ------------------------
(00.001103) 8 was born with sid 4
(00.001105) 7 was born with sid 4
(00.001106) 21 was born with sid 17
(00.001108) 1 was born with sid 17
(00.001109) Error (criu/pstree.c:1005): Can't find a session leader for 17
------------------------ ERROR OVER ------------------------
Corresponding tree before dump:
(combined 'pstree -pS 1' and 'ps axf -o pid,ppid,sid')
session01(1, 0, 1)─┬─session01(3, 1, 1)───session01(4, 3, 4)─┬─session01(5, 4, 5)─┬─session01(23, 5, 5)
│ │ ├─session01(24, 5, 5)
│ │ └─session01(26, 5, 5)
│ ├─session01(6, 4, 4)
│ ├─session01(7, 4, 7)───session01(16, 7, 4)
│ └─session01(8, 4, 8)───session01(15, 8, 15)───session01(20, 15, 4)
├─session01(12, 1, 12)───session01(17, 12, 17)───session01(18, 17, 18)───session01(27, 18, 4)
├─session01(13, 1, 10)
├─session01(14, 1, 4)
└─session01(21, 1, 21)───session01(22, 21, 17)
22 can not restore as it needs session 17, but 17-th's leader is not in
ancestors(21 had been reparented from 17; 12, 13 an 14 from 4).
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
These checks skip adding helpers and setting ids in case
of nested pid namespaces.
FIXME disable pgid, as it does not work yet
v2: add a comment near the added check for pgid
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
- Put code into new handle_init_reparent, make it pidns relative
and call it for each pidns.
- Consider the case when process tree branch(subtree) reparented to init
(parent of root of these branch died) riping some session in two
pieces and representative of these session in reparented branch can
not inherit its session if we simply try to fork the tree as is.
Patch adds helper can_inherit_sid to find such "adopted" brunches and
re-reparent them to helpers.
Previousely we had only direct children of init handled.
- We need many helpers for one session as:
1) The leader of session, if it is already dead, can not be recreated as
a helper in arbitrary pidns. But only in pidns ancestor of pidns of
any alive process of these session (sessions processes can't leave
pidns in which the session had been created).
More over session can be created only on proper level: sid array of the
alive process can end with several zerroes, meaning that after creation
of session, processes had entered several more pidnses, so we need to
cut these extra levels before creating the leader.
2) We can not re-reparent branch directly to session leader as the latter
can be in other pidns, thus create additional helper in our init's pidns,
and it's children will reparent to init.
If parents of session processes are in multiple pidnses we will need
helper per each such pidns, to be able to re-reparent them. See test
with setns for an example
- Collect all helper processes in separate list, so that it would
be easier to find them with get_helper_by_sid for other possibly
existing pieces of these sid. Branches re-reparented to such helpers
are temporary out of the tree and also skipped from walk over items
in for_each_pssubtree_item.
- Collect zombies and helpers which will reparent to init of pidns in
collect_child_pids to init of pidns instead of root task.
- The process tree which had only reparents to pidns init process
(no child subreapers reparents) will be restored fine). One tricky case
than we need re-reparent and the session leader is in same pidns with us
and our parent is in lower pid ns will fail - it happens than somebody
enters the pidns does setsid and then does clone(CLONE_PARENT).
v2: handle get_free_pids returns 0 as error
v3: rebase due to patchwork fail - use add_child_task and move_child_task
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If process belonging to some session is in different pidns than leader
of these session, it will have zeroes on all aditional levels in sid,
so though levels for these process and leader does not match - sids do.
v2: change to static inline function as there is no more pr_err
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Need it to lookup adoptive children of pidns init. Also add
skip_descendants flag to be able to skip unneeded subtrees.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
v2: handle get_free_pids returns 0 as error, remove unneeded iter var in
get_free_pids
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Create a child in new pid_ns; then the child creats thread and zombie.
Zombie is in the second created new pid_ns. Then the great parent
setns() to its active pid_ns. So, lets draw the table:
pid_ns vs pid_for_children_ns
great parent: equal
child: not equal
child thread: equal
grand child zombie: zombies don't have pid_for_children_ns
After signal chech that everything remains the same.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Threads may have different pid_for_children ns.
Allow them to set it after they are created:
just get a pid_ns fd from fdstore, and setns()
to it, after thread creation.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Restore it in dependence of thread numbers:
1)single-threaded -- before user_ns assignment
2)multi-threaded -- after thread creation (in next patch).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In next patches set_pid_for_children_ns() will be used
without pid, so move pid check out of function.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
No functional changes -- just to improve readability.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It may differ from group leader's pid_for_children_ns,
so dump it separate.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Handle it as other namespaces, except of case of
just unshared pid ns w/o child reaper (fail if so),
and in case of we don't have it on restore (old
images -- then get pid_for_children ns from pid ns).
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Account the fact, that ns may be found by alternative
name, and use it if so.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
pid_for_children_ns of root_item may differ from its pid_ns.
In this case we don't want mark such pid_for_children_ns
as NS_ROOT in nsid_add().
Also, we don't want to create fake pid ns, if pid_for_children_ns
is a just create pid_ns without child reaper (kernel returns ENOENT
in this case). Better we later add a support to kernel to pick
such namespaces. So, we return UINT_MAX and fail.
We encode both the possibilities using the only "alternative"
parameter, as it's enough for us. If anybody needs to separate
them in the future and introduce separate parameters, it'd be rather
simple.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It may be read from "/proc/[pid]/ns/pid_for_children_ns",
so add this alias to pid_ns_desc.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This is need for "/proc/[pid]/ns/pid_for_children_ns",
which is a pid namespace with not standard file name.
Also pass "alternative" parameter to generate_ns_id().
v2: Enlarge buffer size
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently it's set in generate_ns_id() and it's equal to:
1)NS_CRIU, if root_ns_mask does not contain CLONE_NEWPID,
2)NS_ROOT, if it contains.
The assignment is not obvious, as it's set NS_CRIU firstly,
and then rewrites in NS_ROOT, if it exists.
Mark top_pid_ns after ns_ids population in more clear way.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Don't call destroy_pid_ns_helpers() twice; add a new case
for that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Parent waits them by pid in its active pid namespace. So,
it's need to converts them there instead of using vpid().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Child reaper of a ns have initial user_ns equal to its pid_ns->user_ns.
Keep all ns assignments together.
v2: Delete the assignment from call_clone_fn() and rename patch.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
1)Create a pid namespace and child reaper in it;
2)Set a specific next pid for future created process;
3)Create one more process in the namespace and kill it;
4)Wait for signal
5)Check, that NSpids of dead task remains the same.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Place child reapers of pid namespaces at the beginning
of pstree_item::children list and sort them by nesting
level.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently, one feature is supported. Add possibility
for a test to depend on several features.
v2: Delete excess "if" as suggested by Andrey Vagin.
Rename variables to decrise patch size.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>