At now we pretend that all threads are sharing seccomp chains
and at checkpoint moment we test seccomp modes to make sure
if this assumption is valid refusing to dump otherwise.
Still the kernel tacks seccomp filter chains per each thread
and now we've faced applications (such as java) where per-thread
chains are actively used. Thus we need to bring support of handling
filters via per-thread basis.
In this a bit intrusive patch the restore engine is lifted up
to treat each thread separately. Here what is done:
- Image core file is modified to keep seccomp filters
inside thread_core_entry. For backward compatibility
former seccomp_mode and seccomp_filter members in
task_core_entry are renamed to have old_ prefix and
on restore we test if we're dealing with old images.
Since per-thread dump is not yet implemeneted the
dumping procedure continue operating with old_ members.
- In pie restorer code memory containing filters are addressed
from inside thread_restore_args structure which now
contains seccomp mode itself and chain attributes
(number of filters and etc).
Reading of per-thread data is done in seccomp_prepare_threads
helper -- we take one pstree_item and walks over every thread
inside to allocate pie memory and pin data there.
Because of PIE specific, before jumping into pie code
we have to relocate this memory into new place and
for this seccomp_rst_reloc is served.
In restorer itself we check if thread_restore_args provides
us enabled seccomp mode (strict or filter passed) and call
for restore_seccomp_filter if needed.
- To unify names we start using seccomp_ prefix for all related
stuff involved into this change (prepare_seccomp_filters renamed
to seccomp_read_image because it only reads image and nothing
more, image handler is renamed to seccomp_img_entry instead
of too short 'se'.
With this change we're now allowed to start collecting and
dumping seccomp filters per each thread, which will be
done in next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Note that there is no real usage of this flag on restore,
we simply save it in image and will make a real use
later.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We want to use a simple fact: If we have an alive process in a pstree we
want to dump, and a starttime of that process is less than pre-dump's
timestamp (taken while all processes were freezed), then these exact
process existed (100% sure) at the time of these pre-dump and the
process' memory was dumped in images.
So save inventory image on pre-dump and put there an uptime.
https://jira.sw.ru/browse/PSBM-67502
v9: improve comment, put uptime to ivnentory image as 1) where is no
stats in parent images directory if --work-dir option is set to
something different then images directory, 2) stats-dump is not an image
and it is a bad practice to put there data required for restoring.
v10:s/u_int64_t/uint64_t/
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently we restore all sockets in the root mount namespace, because we
were not able to get any information about a mount point where a socket
is bound. It is obviously incorrect in some cases.
In 4.10 kernel, we added the SIOCUNIXFILE ioctl for unix sockets. This
ioctl opens a file to which a socket is bound and returns a file
descriptor.
This new ioctl allows us to get mnt_id by reading fdinfo, and mnt_id
is enough to find a proper mount point and a mount namespace.
The logic of this patch is straight forward. On dump, we save mnt_id for
sockets, on restore we find a mount namespace by mnt_id and restore this
socket in its mount namespace.
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This adds new tunfile_entry::ns_id field and populates
it in standard socket way. Restore uses this ns_id
to choose correct namespace. Note, we could completelly
skip set_netns() on restore in case of !has_ns_id, but
using top_net_ns invents some definite behaviour.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
ktkhai: comment written/code movings
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
There are two problems. The first is CTL_TTY_OFF occupies
one of the biggest available fds in the system. It's a number
near service_fd_rlim_cur. Next patches want to allocate
service fds lower, than service_fd_rlim_cur, and they want
to know max used fd from file fles after the image reading.
But since one of fds is already set very big (CTL_TTY_OFF)
on a stage of collection fles, the only availabe service
fds are near service_fd_rlim_cur. It's vicious circle,
and the only way is to change ctl tty fd allocation way.
The second problem is ctl tty is ugly out of generic file
engine fixup (see open_fd()). This is made because ctl tty
is the only slave fle, which needs additional actions
(see tty_restore_ctl_terminal()). Another file types just
receive their slave fle, and do not do anything else.
This patch moves ctl tty to generic engine and solves all
the above problems. To do that, we implement new CTL_TTY
file type, which open method waits till slave tty is received
and then calls tty_restore_ctl_terminal() for that. It fits
to generic engine well, and allocates fd via find_unused_fd(),
and do not polute file table by big fd numbers.
Next patch will kill currently unneed CTL_TTY leftovers
and will remove CTL_TTY_OFF service fd from criu.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Add a fake fd type for autofs. This allows functions
like find_file_desc() work as expected, without
having two different file_desc with the same type
and same id.
Also, later, it will allow to delete autofs_create_fle()
and to use generic helper.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
A network device, which is connected to a bridge, is restored
after the bridge. In this case we can set the master attribute and
the device will be connected to the bridge automatically.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is possible to assign id for network namespaces and
this id will be used by the kernel in some netlink messages.
If no id is assigned when the kernel needs it, it will be
automatically assigned by the kernel.
For example, this id is reported for peer veth devices.
v2: add a comment
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Each socket has to be restored in a proper namespaces where
it has been created.
Here is an issue about unconnected and unbound sockets,
they are not reported via socket-diag and we can't to
get their network namespaces.
v2: add a comment before get_socket_ns()
remove nsid from sk_packet_entry
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In this case we can wait it and get an exit code.
For example, it will be useful for p.haul where one connection
is used several times, so we need a way how to understand that
page-server exited unexpectedly.
v2: don't write ps_info if a start descriptor isn't set
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The number of pipes are limited in a system, so it is better to know how
many we use.
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The SO_REUSEPORT option allows multiple sockets on the same
host to bind to the same port. This option has to ve restored when all
sockets are bound to a port. The same logic is already used to restore
SO_REUSEADDR.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Most of the pieces has already been described in the previous patches :)
so here's the summary.
* Dump:
When receiving a message, also receive any SCM-s (already there) and when
SCM_RIGHTs one is met -- go ahead and just dump received descriptors using
regular code, but taking current as the victim task.
Few words about file paths resolution -- since we do dump path-ed files
by receiving them from victim's parasite, such files sent via sockets
should still work OK, as we still receive them, just from another socket.
Several problems here:
1. Unix sockets sent via unix sockets form knots. Not supported.
2. Eventpolls sent via unix might themseves poll unix sockets. Knots
again. Not supported either.
* Restore:
On restore we need to make unix socket wait for the soon-to-be-scm-sent
descriptors to get restored, so we need to find them, then put a dependency.
After that, the fake fdinfo entry is attached to the respective file
descs, when sent the respective descriptors are closed.
https://github.com/xemul/criu/issues/251
v2: Addressed comments from Kirill
* Moved prepare_scms before adding fake fles (with comment)
* Add scm-only fles as fake, thus removing close_scm_fds
* Try hard finding any suitable fle to use as scm one when
queuing them for unix socket scm list, only allocate a new
one if really needed
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Same thing as for fifo-s.
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The plan is to have all file entries have unique ID. Fifo
generates a reg file entry to reuse the whole reg-files
c/r-ing engine (ghosts, open-by-path, etc.) and right now
ID for this entry is the same as for fifo entry.
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Nothing special here, just parse all known NLAs and keep them
on the image.
Issue #11
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Dump and restore process with RI control block. Runtime instrumentation
allows to collect information about program execution as CPU data.
The RI control block is dumped and restored for all threads.
Ptrace kernel interface is provided by:
c122bc239b13 ("s390/ptrace: add runtime instrumention register get/set")
Signed-off-by: Alice Frosi <alice@linux.vnet.ibm.com>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Dump and restore tasks with GS control blocks. Guarded-storage is a new
s390 feature to improve garbage collecting languages like Java.
There are two control blocks in the CPU:
- GS control block
- GS broadcast control block
Both control blocks have to be dumped and restored for all threads.
Signed-off-by: Alice Frosi <alice@linux.vnet.ibm.com>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
To use lazy-pages from runc '--lazy-pages' functionality needs to be
accessible via RPC. This enables lazy-pages via RPC.
The information on which port to listen is taken from the
criu_page_server_info protobuf structure. If the user has enabled
lazy-pages via RPC only criu_page_server_info.port is evaluated
to get the listen port.
With additional patches in runc is it possible to use lazy-restore
with 'runc checkpoint' and 'runc restore'.
travis-ci: success for lazy-pages: enable lazy-pages via RPC
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Note, that since zero pages stats never been into master we can make
incompatible changes to stats image.
travis-ci: success for revert zero pagemaps
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Extend the RPC feature check functionality to also test for lazy-pages
support. This does not check for certain UFFD features (yet). Right now
it only checks if kerndat_uffd() returns non-zero.
The RPC response is now transmitted from the forked process instead of
encoding all the results into the return code. The parent RPC process
now only sends an RPC message in the case of a failure.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This translates pagemap flags into strings for easier readability.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Having three booleans in pagemap entry clues for usage of good old flags.
Replace 'zero' and 'lazy' booleans with flags and use flags for internal
tracking of in_parent value. Eventually, in_parent may be deprecated.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
When appropriate, the lazy pages will no be written to the destination.
Instead, a pagemap entry for range of such pages will be marked with 'lazy'
flag.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The pages that are mapped to zero_page_pfn are not dumped but information
where are they located is required for lazy restore.
Note that get_pagemap users presumed that zero pages are not a part of the
pagemap and these pages were just silently skipped during memory restore.
At the moment I preserve this semantics and force get_pagemap to skip zero
pages.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
venet is not virtuozzo specific but rather
came from openvz, make it so.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There are two goals of this merge. First is to reduce the amount
of image files we generate and scan on restore. The latter is
more importaint, as even if we have no weird stuff like signalfd,
we still try to open this file. So after the merge we try to
open ~15 image files (out of ~30) less %) which is nice.
The 2nd goal is to simplify the C/R support for SCM messages.
This becomes possible with the fact, that all files we have can
be distinguished by their ID only, w/o type. This, in turn,
makes image layout for SCMs much simpler.
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If the ghost file is too big, it might make sence to try seeking
for holes in it, thus reducing the image size.
We've seen this once for tmpfs files in issue #230.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>