CID 190174 (#1 of 1): Argument cannot be negative (NEGATIVE_RETURNS)
6. negative_returns: fd is passed to a parameter that cannot be negative.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
>>> CID 190177: Integer handling issues (NEGATIVE_RETURNS)
>>> rpc_sk is passed to a parameter that cannot be negative.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
CID 84654 (#1 of 1): Resource leak (RESOURCE_LEAK)
6. leaked_handle: Handle variable fd going out of scope leaks the handle.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The purpose of the num_children field in 'struct lazy_pages_info' was to
prevent closing the page-read while there are still active processes that
share it. It did work for the case when handling of the child processes
finished before the parent process. However, if the parent lpi is closed
first, we've got a dangling pointer at lpi->parent.
The obvious solution is to use simple reference counting.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is possible that notification about restore finish arrives at the same
time with a fork event. In such case we return to epoll_run_rfds without
resetting the poll_timeout and then we'll keep polling for events
indefinitely. To avoid this, we reset the poll_timeout to 0 as soon as we
know that restore is finished.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The only use for the userfault file descriptor after the process exited is
for debug logs. Using negative value instead of 0 makes logs more readable.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Although they are the same, xfree() looks more consistent with other code
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There are a lot of lines, which are longer than 79:
(04.331172)pie: 1: Error (criu/pie/restorer.c:460): seccomp: Unexpected tid ->
(04.331172)pie: 1: 1 != 1
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Credential commitment affects dumpable and pdeath signals
so we have to move their restore after the restore_creds,
just like we have in __export_restore_task (ie for
group leader).
https://jira.sw.ru/browse/PSBM-84198
Acked-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
kerndat_try_load_cache() fills kdat from /run/criu.kdat,
so it will contain some trash, if criu.kdat isn't compatible with the
current version of criu.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Looking up for pid in nesting pidns supposed to be done
for non group leaders only, thus __export_restore_thread
do this check on its own and we don't have to make
a similar lookup especially on group leader where
pids in args never were valid.
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Andrew proposed the test which actually triggered the issue
in current seccomp series, put it into a regular basis.
Suggested-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When considering if we to call PTRACE_O_SUSPEND_SECCOMP
on the tid we should take into account if there at least
one thread which has seccomp mode enabled, otherwise
we might miss filter suspension and restore procedure
might break due to own criu syscall get filtered out.
Same time we should move seccomp restore for threads
to take place after CR_STATE_RESTORE_SIGCHLD state
so that main criu code will attach to threads and
setup seccomp suspension flag before we start
restoring the filters.
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
To checkpoint per-thread seccomp filters we need
a significant rework of a dumping code. The general
idea is the following:
- Each thread is tracked by its tid inside global
seccomp rbtree thus we can easily add entries
there or lookup on demand.
- When we collect threads into pstree entries we fetch
its seccomp mode from procfs parsing routine and allocate
a new entry inside rbtree to remember the seccomp mode.
Note at this moment we're not dumping real filters yet
(because filter data image is a single one for all consumers)
- Once all tids are collected and our tree is complete we call for
seccomp_collect_dump_filters helper which walks every pstree entry
and iterate over each tid inside thread group calling
seccomp_dump_thread, which in turn uses ptrace engine to fetch
filters and keep this data in memory.
To optimize data usage we figure out if we can use TSYNC flag
on restore calling try_use_tsync helper: for TSYNC flag kernel
automatically propagate filter to all threads, thus we need to
compare all filters inside thread group for identity since there
is no other way to figure out if user passed TSYNC flag when
been creating filters.
- Finally dump_seccomp_filters is called which does real write
of seccomp filter data into an image file.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
At now we pretend that all threads are sharing seccomp chains
and at checkpoint moment we test seccomp modes to make sure
if this assumption is valid refusing to dump otherwise.
Still the kernel tacks seccomp filter chains per each thread
and now we've faced applications (such as java) where per-thread
chains are actively used. Thus we need to bring support of handling
filters via per-thread basis.
In this a bit intrusive patch the restore engine is lifted up
to treat each thread separately. Here what is done:
- Image core file is modified to keep seccomp filters
inside thread_core_entry. For backward compatibility
former seccomp_mode and seccomp_filter members in
task_core_entry are renamed to have old_ prefix and
on restore we test if we're dealing with old images.
Since per-thread dump is not yet implemeneted the
dumping procedure continue operating with old_ members.
- In pie restorer code memory containing filters are addressed
from inside thread_restore_args structure which now
contains seccomp mode itself and chain attributes
(number of filters and etc).
Reading of per-thread data is done in seccomp_prepare_threads
helper -- we take one pstree_item and walks over every thread
inside to allocate pie memory and pin data there.
Because of PIE specific, before jumping into pie code
we have to relocate this memory into new place and
for this seccomp_rst_reloc is served.
In restorer itself we check if thread_restore_args provides
us enabled seccomp mode (strict or filter passed) and call
for restore_seccomp_filter if needed.
- To unify names we start using seccomp_ prefix for all related
stuff involved into this change (prepare_seccomp_filters renamed
to seccomp_read_image because it only reads image and nothing
more, image handler is renamed to seccomp_img_entry instead
of too short 'se'.
With this change we're now allowed to start collecting and
dumping seccomp filters per each thread, which will be
done in next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Note that there is no real usage of this flag on restore,
we simply save it in image and will make a real use
later.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This header is main place for all seccomp related
structures so move seccomp_info here. This will
allow to minimize changes area when need to update
definitions and such.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will use it to figure out if filter log target is used.
Metadata associated with seccomp filter is relatively new
feature which allows userspace to get and set it back.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is one of our target platforms.
Cc: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On ppc64/aarch64 Linux can be set to use Large pages, so the PAGE_SIZE
isn't build-time constant anymore. Define it through _SC_PAGESIZE.
There are different sizes for a page on ppc64:
: #if defined(CONFIG_PPC_256K_PAGES)
: #define PAGE_SHIFT 18
: #elif defined(CONFIG_PPC_64K_PAGES)
: #define PAGE_SHIFT 16
: #elif defined(CONFIG_PPC_16K_PAGES)
: #define PAGE_SHIFT 14
: #else
: #define PAGE_SHIFT 12
: #endif
And on aarch64 there are default sizes and possibly someone can set his
own PAGE_SHIFT:
: config ARM64_PAGE_SHIFT
: int
: default 16 if ARM64_64K_PAGES
: default 14 if ARM64_16K_PAGES
: default 12
On the downside - each time we need PAGE_SIZE, we're doing libc
function call on aarch64/ppc64.
Fixes: #415
Tested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
PAGE_SIZE will be a variable value on platforms where it can be
different due to large pages.
And looks like (c) there is no reason for BUF_SIZE == PAGE_SIZE,
so let's keep it as it was, rather than complicating it with dynamic
allocation for the buffer.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The same value, but as PAGE_SIZE can be different for the same
platform - it's no more static value.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Personality value is printed in kernel like this:
static int proc_pid_personality(/* .. */)
{
int err = lock_trace(task);
if (!err) {
seq_printf(m, "%08x\n", task->personality);
unlock_trace(task);
}
return err;
}
So, we don't need a whole page to read the value.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
For architectures like aarch64/ppc64 it's needed to propagate the size
of page inside PIEs. For the parasite page size will be defined during
seizing, and for restorer during early initialization.
Afterward we can use PAGE_SIZE in PIEs like we did before.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The macro is used only in aio_estimate_nr_reqs():
unsigned int k_max_reqs = NR_IOEVENTS_IN_NPAGES(size/PAGE_SIZE);
Which compiler may evaluate as (((PAGE_SIZE*size)/PAGE_SIZE) - ...)
It works as long as PAGE_SIZE is long.
The patches set converts PAGE_SIZE to use sysconf() returning
(unsigned), non-long type and making the aio macro overflowing.
I do not see any value making PAGE_SIZE (unsigned long) typed.
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's actually number of bytes spliced, not pages.
And I bet (unsigned long) suits the purpose more than (int).
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's unused since commit fd3f33f5d23a ("headers: image.h -- Drop unused
entries"), so let's remove it completely.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
iptables creates /run/xtables.lock file and
we want to have it per-test.
(00.332159) 1: Running iptables-restore -w for iptables-restore -w
Fatal: can't open lock file /run/xtables.lock: Permission denied
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Return an error if we meet unexpected parameters in a config file
Cc: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Now we rely on scanf, that it will initializes a pointer to NULL, when
it fails to parse a string, but I can't find in a man page, that it has
to do this.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Starting with iptables 1.6.2, we have to use the --wait option,
but it doesn't work properly with userns, because in this case,
we don't have enough rights to open /run/xtables.lock.
(00.174703) 1: Running iptables-restore -w for iptables-restore -w Fatal: can't open lock file /run/xtables.lock: Permission denied
(00.192058) 1: Error (criu/util.c:842): exited, status=4
(00.192080) 1: Error (criu/net.c:1738): iptables-restore -w failed
(00.192088) 1: Error (criu/net.c:2389): Can't create net_ns
(00.192131) 1: Error (criu/util.c:1567): Can't wait or bad status: errno=0, status=65280
This patch workarounds this problem by mounting tmpfs into /run.
Net namespaces are restored in a separate process, so we can create a
new mount namespace and create new mounts.
https://github.com/checkpoint-restore/criu/issues/469
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
iptables 1.6.2 fails without /run, because it tries to create
the /run/xtables.lock file
Test output: ================================
Fatal: can't open lock file /run/xtables.lock: No such file or directory
23:29:06.098: 4: ERR: netns-nf.c:21: Can't set input rule (errno = 2 (No such file or directory))
23:29:06.098: 3: ERR: test.c:315: Test exited unexpectedly with code 255
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>