This one gets index from fd/fd_parms pair on dump. For
console and vt the index is constant and just sits on
the tty_type (will also be used on restore, see next
patch).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The plan is to replace tons of if (type == TTY_TYPE_FOO) checks
with type->something dereferences.
To do this, start with replacing int type with struct tty_type *
in relevant places and fixing compilation.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Here it is. The major new thing I think is the CRIT tool
that will be the main one to mainupulate images and will
eventually replace the "criu show" action.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
and print errno for the wait syscall in an error case
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If we parse /proc/pid/status when a task isn't stopped,
we can't be sure that a process state will not be changed.
08:58:48 Test: zdtm/live/user/static/zombie00, Namespace: 1
08:58:48 Dump log : /var/lib/jenkins/jobs/CRIU-dump/workspace/test/dump/ns/user/static/zombie00/114/1/dump.log
08:58:48 --------------------------------- grep Error ---------------------------------
08:58:48 (00.001127) Error (ptrace.c:124): SEIZE 121: task not stopped after seize
v2: don't believe into errno (by xemul@)
Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
/dev/tty stands for current terminal which we don't yet
implemented a support for.
This is a bugfix for upcoming stable version, the proper
support of /dev/tty is gonna be implemented separately.
Reported-by: Saied Kazemi <saied@google.com>
CC: Andrew Vagin <avagin@parallels.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A tests is executed from different users in cases with and without
userns, so it can't to open files which were created before.
Here is an example for ns/user/static/inotify_irmap:
13355 mkdir("/etc", 0600) = -1 EEXIST (File exists)
13355 unlink("/etc/zdtm-test") = -1 EACCES (Permission denied)
13355 creat("/etc/zdtm-test", 0600) = -1 EACCES (Permission denied)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
One test can't be execute as ns/test ans ns/user/test
simultaneously, because they use the same file tree
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For an established TCP connection, the send queue is restored in two
steps: in step (1), we retransmit the data that was sent before but not
yet acknowledged, and in step (2), we transmit the data that was never
sent outside before. The TCP_REPAIR option is disabled before step (2)
and re-enabled after step (2) (without this patch).
If the amount of data to be sent in step (2) is large, the TCP_REPAIR
flag on the socket can remain off for some time (O(milliseconds)). If a
listen() is called on another socket bound to the same port during this
time window, it fails. This is because -- turning TCP_REPAIR off clears
the SO_REUSEADDR flag on the socket.
This patch adds a mutex (reuseaddr_lock) per port number, so that a
listen() on a port number does not happen while SO_REUSEADDR for another
socket on the same port is off.
Thanks to Amey Deshpande <ameyd@google.com> for debugging.
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
"%m" can't be used to print strerror(errno), because test_msg()
calls gettimeofday() which can overwrite errno.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We link files to each other at restore time to restore
unlinked paths. Kernel has strange secutiry restrictions
about linkat we use. If the fsuid of the caller doesn't
equals the uid of the file and the file is not "safe"
one, then only global CAP_CHOWN will be allowed to link().
This brings problems in user namespaces -- uns root is
not allowed to linkat any file, unlike global root.
Fortunately, we can change the fsuid temporarily and
still linkat the file we want. Hopefully this hack will
go away some day soon, when the kernel will have saner
checks for linkat capabilities.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
The test uses map_files dir to check for mapping being restored,
while this proc directory is only available for CAP_SYS_ADMIN.
Fix this by checking less strict /proc/pid/maps.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The rest partially need more userns_call-s but mostly just don't
work in userns themselves. Need further investigation.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Locked termios require global CAP_SYS_ADMIN. But let's
restore everything for tty in one call since regular
termios depend on locked and it's not nice to do sync
usernsd call for locked only.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
The syscall in question requires global CAP_DAC_READ_SEARCH.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
We have collected a good set of calls that cannot be done inside
user namespaces, but we need to [1]. Some of them has already
being addressed, like prctl mm bits restore, but some are not.
I'm pretty sceptical about the ability to relax the security
checks on quite a lot of them (e.g. open-by-handle is indeed a
very dangerous operation if allowed to unpriviledged user), so
we need some way to call those things even in user namespaces.
The good news about it its that all the calls I've found operate
on file descriptors this way or another. So if we had a process,
that lived outside of user namespace, we could ask one to do the
high priority operation we need and exchange the affected file
descriptor via unix socket.
So the usernsd is the one doing exactly this. It starts before we
create the user namespace and accepts requests via unix socket.
Clients (the processes we restore) send him the functions they
want to call, the descriptor they want to operate on and the
arguments blob. Optionally, they can request some file descriptor
back after the call.
In non usernamespace case the daemon is not started and the calls
are done right in the requestor's process environment.
In the next patch there's an example of how to use this daemon
to do the priviledged SO_SNDBUFFORCE/_RCVBUFFORCE sockopt on
a socket.
[1] http://criu.org/UserNamespace
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
User is already able to see it in stdout, so there is no
reason why we should protect it.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If criu run with suid bit set, user should be able
to read pidfiles(i.e. service pidfile).
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This will allow us to easily extend commands that crit
supports, avoiding "--help" confusion.
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Starting with version 3.15, the kernel provides a mnt_id field in
/proc/<pid>/fdinfo/<fd>. However, the value provided by the kernel for
AUFS file descriptors obtained by opening a file in /proc/<pid>/map_files
is incorrect.
Below is an example for a Docker container running Nginx. The mntid
program below mimics CRIU by opening a file in /proc/1/map_files and
using the descriptor to obtain its mnt_id. As shown below, mnt_id is
set to 22 by the kernel but it does not exist in the mount namespace of
the container. Therefore, CRIU fails with the error:
"Unable to look up the 22 mount"
In the global namespace, 22 is the root of AUFS (/var/lib/docker/aufs).
This patch sets the mnt_id of these AUFS descriptors to -1, mimicing
pre-3.15 kernel behavior.
$ docker ps
CONTAINER ID IMAGE ...
3850a63ee857 nginx-streaming:latest ...
$ docker exec -it 38 bash -i
root@3850a63ee857:/# ps -e
PID TTY TIME CMD
1 ? 00:00:00 nginx
7 ? 00:00:00 nginx
31 ? 00:00:00 bash
46 ? 00:00:00 ps
root@3850a63ee857:/# ./mntid 1
open("/proc/1/map_files/400000-4b8000") = 3
cat /proc/49/fdinfo/3
pos: 0
flags: 0100000
mnt_id: 22
root@3850a63ee857:/# awk '{print $1 " " $2}' /proc/1/mountinfo
87 58
103 87
104 87
105 104
106 104
107 104
108 87
109 87
110 87
111 87
root@3850a63ee857:/# exit
$ grep 22 /proc/self/mountinfo
22 21 8:1 /var/lib/docker/aufs /var/lib/docker/aufs ...
44 22 0:35 / /var/lib/docker/aufs/mnt/<ID> ...
$
Signed-off-by: Saied Kazemi <saied@google.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If a process is executed in another pidns, a /proc/PID doesn't link with
the proper process.
This patch fixes a problem like this:
1: Error (util.c:106): Unable to close fd 33: Bad file descriptor
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
There are two places where we store IP addresses (both IPv4 and IPv6).
Mark them with custom option and print them in compressed form for
--pretty output.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Ruslan Kuprieiev <kupruser@gmail.com>
I plan to mark some fields as IP address and print them respectively.
The --format hex is not nice switch for this and introducing one more
(--format hex ipadd) is too bad.
So let's fix the cirt API to be simple and stupid. By default crit
generates canonical one-line JSON. With --pretty option it splits the
output into lines, adds indentation and prints hex as hex and IP as
IP.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Ruslan Kuprieiev <kupruser@gmail.com>
Can be useful to re-run some tests in case smth failed in the middle
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>