We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
For that we need to save per-namespace mappings of user and group IDs.
And all id-s for tasks and files are saved from the target user
namespace.
v2: move code into collect_namespaces()
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support user namespaces and uid-s will be converted
accoding with userns mappings.
v2: conver id-s for sockets too
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is for two reasons. First, validation can meet external mount
and will call plugins, which is not correct on pre-dump and actually
crashes on uninitilized plugins lists. Second, even if on pre-dump
mount tree is not "supported" this can be a temporary situation (yes,
yes, unlikely, but still).
On the other hand, it's better to fail earlier, but that's another
story.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
On pre-dump we collect only two namespaces -- the mnt one
for criu and mnt one again for root task.
This is not correct. We need all mount namespaces to make
the irmap generation work properly and we need all net
namespaces to have parasite sockets created.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The setns() syscall (called by switch_ns()) can be extremely
slow. If we call it two or more times from the same task the
kernel will synchonously go on a very slow routine called
synchronize_rcu() trying to put a reference on old namespaces.
To avoid doing this more than once I propose to create all
per-ns sockets in one place with one setns call. In this
patch there's on nl diag socket used to collect other sockets
is created this way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We currently have all mouninfo-s from all mnt namespaces collected
in one big list. On dump we scan through it to find the namespaces
we need to dump.
This can be optimized by walking the list of namespaces instead.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We are going to support nested mntns, so the global mntinfo_tree
variable are useless and information about tree should be connected
to a proper namespace.
But when we don't dump mntns, we need to collect mounts for the current
mntns.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We are going to support nested mount namespaces and each NS has own
tree. The mount tree is used for checking that a file is reachable.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
because we want to check, that all files are reachable.
For that we need to collect all mounts from all namespaces.
v2: dump mntns separately
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now we supports sub-mntns, so root_ns_mask sounds more correct than
current_ns_mask.
v2: typo fix
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Known issue:
* currently only namespaces with the same root is supported
* nested namespaces can be dumped and restored only if the root task
has own mount namespace.
All nested namespaces are restored in a root namespace in temporary
directories. All mount points restored in one tree and then they are
divided into namesaces.
The task with minimal pid for each namespaces unshared mntns and
then it makes pivot_root in a proper temporary directory. All other
tasks makes setns to enter into a mount namespace of the task with
minimal pid.
v2: clean up
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently ns_ids list is filled only on dump. Soon we'll need this
list for mount namespaces on restore, e.g. to know which tasks share
the namespaces.
v2: merge the patch "namespace: add a function to search an ns_id
item by id" into this one.
v3: add prefix rst_ to add_ns_id
v4: look up namespace by two values -- type AND ID
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's going to be used for restoring namespaces. For example we need to
enumirate the ns_ids list for restoring mount namespaces.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Well, we want to pre-dump files (fsnotifies), for that we
will need mountinfo-s and root, and for the latter -- the
current ns mask.
The problem with current ns mask is that its generation is
incorporated into ns IDs generation and dumping. And since
the ids dumping is not performed on pre-dump, let's just
provide a helper for ns-mask generation.
Strictly speaking, the whole ns-mask idea is not great, but
it's to be fixed later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We really have a mess of extern/non-extern declaration
of functions in our headers. Always use extern for
unification purpose.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since *all* of them just call do_dump_gen_file with proper ops,
just call one directly. Compacts the code.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Based on work done by Cyrill Corcunov (many thanks for that).
In this commit we implement c/r for files which have opened
/proc/$pid/ns/$ids entries.
The idea is rather simple one
Checkpoint
==========
- Check if the file name is the one of known to be ns ref
- If match then write protobuf entry
Restore
=======
- Read all ns entries from the image
- When criu tries to open one we lookup over process
tree to figure out which PID should be used in path
and then just open it
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This will be needed for fast parsing of procfs ns references.
[ xemul: Add user_ns_desc here ]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
I'll need to modify this header so before anything
else lets beautify it
- drop struct pstree_item declaration it's already in pstree.h
- move struct cr_options to top
- align members of struct ns_desc
- move externs to top
- add argument name to try_show_namespaces
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we've read all pstree-items and their ids we
can get the desired clone-flags early and avoid all
these dances with flag calculations in fork_with_pid
and company.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Introduce the current_ns_mask variable, that collects info about
which namespaces tasks being dumped and to be restored live in.
For simlicity all tasks are supposed to live in one set of spaces.
This should be fixed eventually.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The recent kernels allow to get namespaces IDs by reading proc-ns links.
Use this to generate IDs for tasks' namespaces (I do generate them, since
IDs provided by kernel look ugly :( ).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These are structs that (now) tie together ns string
and the CLONE_ flag. It's nice to have one (some code
becomes simpler) and will help us with auto-namespaces
detection.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
I believe this make sense to keep this structure
in pstree.h where pstree related data lives.
Also I've added some comments on struct pid members.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This will be required for parasite transport socket creation -- it will
have to be created in a net ns we're putting parasite in and then we'll
have to restore it back to original to go on dumping.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It uses pid for create image file and real_pid for dumping ns-s.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Each fdset item now has the callback which will show a contents of a magic-described
image file. Per-task and global show code is reworked to walk the respective fdsets
and calling ->show on each file.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
v2: strlen() check removed from parse_ns_string()
Now '-n' option must be followed by namespaces tags, separated by commas.
Currently, only "uts" namespace is supported.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Generic code will be used for other namespaces.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Only two fields are modifiable -- hostname and domainname. So
read them on dump and write on restore.
File format is simple --
u32 magic
u32 length of nodename
u8[] nodename string
u32 length of domainname
u8[] domainname string
For OpenVZ we can write the release at the end, but this is later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
New option -n to dump/restore namespaces.
Fork the namespaces dumping task and write a helper for switching a namespace.
Prepare the restorer code for restoring namespaces before root task.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>