We will use it to figure out if filter log target is used.
Metadata associated with seccomp filter is relatively new
feature which allows userspace to get and set it back.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
On ppc64/aarch64 Linux can be set to use Large pages, so the PAGE_SIZE
isn't build-time constant anymore. Define it through _SC_PAGESIZE.
There are different sizes for a page on ppc64:
: #if defined(CONFIG_PPC_256K_PAGES)
: #define PAGE_SHIFT 18
: #elif defined(CONFIG_PPC_64K_PAGES)
: #define PAGE_SHIFT 16
: #elif defined(CONFIG_PPC_16K_PAGES)
: #define PAGE_SHIFT 14
: #else
: #define PAGE_SHIFT 12
: #endif
And on aarch64 there are default sizes and possibly someone can set his
own PAGE_SHIFT:
: config ARM64_PAGE_SHIFT
: int
: default 16 if ARM64_64K_PAGES
: default 14 if ARM64_16K_PAGES
: default 12
On the downside - each time we need PAGE_SIZE, we're doing libc
function call on aarch64/ppc64.
Fixes: #415
Tested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
PAGE_SIZE will be a variable value on platforms where it can be
different due to large pages.
And looks like (c) there is no reason for BUF_SIZE == PAGE_SIZE,
so let's keep it as it was, rather than complicating it with dynamic
allocation for the buffer.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The same value, but as PAGE_SIZE can be different for the same
platform - it's no more static value.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Personality value is printed in kernel like this:
static int proc_pid_personality(/* .. */)
{
int err = lock_trace(task);
if (!err) {
seq_printf(m, "%08x\n", task->personality);
unlock_trace(task);
}
return err;
}
So, we don't need a whole page to read the value.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
For architectures like aarch64/ppc64 it's needed to propagate the size
of page inside PIEs. For the parasite page size will be defined during
seizing, and for restorer during early initialization.
Afterward we can use PAGE_SIZE in PIEs like we did before.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The macro is used only in aio_estimate_nr_reqs():
unsigned int k_max_reqs = NR_IOEVENTS_IN_NPAGES(size/PAGE_SIZE);
Which compiler may evaluate as (((PAGE_SIZE*size)/PAGE_SIZE) - ...)
It works as long as PAGE_SIZE is long.
The patches set converts PAGE_SIZE to use sysconf() returning
(unsigned), non-long type and making the aio macro overflowing.
I do not see any value making PAGE_SIZE (unsigned long) typed.
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's actually number of bytes spliced, not pages.
And I bet (unsigned long) suits the purpose more than (int).
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It's unused since commit fd3f33f5d2 ("headers: image.h -- Drop unused
entries"), so let's remove it completely.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
iptables creates /run/xtables.lock file and
we want to have it per-test.
(00.332159) 1: Running iptables-restore -w for iptables-restore -w
Fatal: can't open lock file /run/xtables.lock: Permission denied
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Return an error if we meet unexpected parameters in a config file
Cc: Veronika Kabatova <vkabatov@redhat.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Now we rely on scanf, that it will initializes a pointer to NULL, when
it fails to parse a string, but I can't find in a man page, that it has
to do this.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Starting with iptables 1.6.2, we have to use the --wait option,
but it doesn't work properly with userns, because in this case,
we don't have enough rights to open /run/xtables.lock.
(00.174703) 1: Running iptables-restore -w for iptables-restore -w Fatal: can't open lock file /run/xtables.lock: Permission denied
(00.192058) 1: Error (criu/util.c:842): exited, status=4
(00.192080) 1: Error (criu/net.c:1738): iptables-restore -w failed
(00.192088) 1: Error (criu/net.c:2389): Can't create net_ns
(00.192131) 1: Error (criu/util.c:1567): Can't wait or bad status: errno=0, status=65280
This patch workarounds this problem by mounting tmpfs into /run.
Net namespaces are restored in a separate process, so we can create a
new mount namespace and create new mounts.
https://github.com/checkpoint-restore/criu/issues/469
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
iptables 1.6.2 fails without /run, because it tries to create
the /run/xtables.lock file
Test output: ================================
Fatal: can't open lock file /run/xtables.lock: No such file or directory
23:29:06.098: 4: ERR: netns-nf.c:21: Can't set input rule (errno = 2 (No such file or directory))
23:29:06.098: 3: ERR: test.c:315: Test exited unexpectedly with code 255
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Make sure we handle various corner cases:
* we received less pages than requested
* the request was capped because of unmap/remap etc
* the process has exited underneath us
Currently we are freeing the request once we've found the address to use
with uffd_copy(). Instead, let's keep the request object around, use it to
properly calculate number of pages we pass to uffd_copy() and then re-add
tailing range (if any) to the IOVs list.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of pre-parsing command line twice, one time to detect -h/--help and
another time to find config file parameter, check for both in one pass.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When config parsing was split into a separate part the handling of
-h/--help option during init_config was broken. Fix it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently kerndat_init() runs before command line parsing and running
simple 'criu --version' command may produce something like:
Warn (criu/kerndat.c:847): Can't load /run/criu.kdat
Error (criu/util.c:842): exited, status=3
Error (criu/util.c:842): exited, status=3
Write 4294967295 to /proc/self/loginuid failed: Operation not permittedWarn
(criu/net.c:2732): Unable to get socket network namespace
Warn (criu/net.c:2732): Unable to get tun network namespace
Warn (criu/sk-unix.c:213): sk unix: Unable to open a socket file:
Operation not permitted
Error (criu/net.c:3023): Unable create a network namespace: Operation not
permitted
Warn (criu/net.c:3069): NSID isn't reported for network links
Version: 3.6
GitID: v3.6-611-g0b27d0a
Group early calls to kerndat_* and init_service_fd calls into a function
and call this function after the command line parsing is finished.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This function is an analogue to vsprintf(), and is used in very much the
same way. The caller expects the modified string pointer to be pointing to
a null-terminated string.
Signed-off-by: Joel Nider <joeln@il.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The string returned by std_vprint_num() is right-aligned in the buffer.
Therefore, we must print the string starting from the pointer returned in
the 'ps' argument, and not from the start of the original buffer.
Signed-off-by: Joel Nider <joeln@il.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of merging unfinished requests with child's IOVs we queued them
into parent's IOV list. Fix it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Commit 9cb20327aa ("return to epoll_wait after completing forks") was only
half way there. Adding the other half.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In file included from s390x_gs_threads.c:10:0:
../lib/lock.h: In function 'mutex_lock':
../lib/lock.h:148:4: error: implicit declaration of function 'pr_perror' [-Werror=implicit-function-declaration]
pr_perror("futex");
Reported-by: Mr Jenkins
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We shouldn't set MAKEFLAGS by the following reasons:
1. User may want to specify some make parameter (e.g., `-d` for debug)
2. We lose parallel build. No `-j` is passed to submake and it looks
like, gnu/make will not deal with parallel recursive make if
$(MAKEFLAGS) is unset back.
Easy to verify: Add `sleep 3` to build rule in Makefile.inc and
you'll find only one sleep process at a time. After the patch
if you specify say `-j5` to make - you'll have 5 sleep processes.
Reverts: commit e9beed7bb3 ("build: zdtm -- Add implicit rules into
zdtm building").
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Let's drop usage of COMPILE.c, OUTPUT_OPTION.
It will allow run submake with -R.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
$(MAKEFLAGS) already contains -r -R and --no-print-directory: those
flags are being added in include.mk.. which is included two lines above.
There is no comment and I see no big sense in erasing $(MAKEFLAGS),
rather than adding those flags. So I considered this as a typo.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Here is one of often mistakes:
int funcX()
{
int ret;
ret = funcA()
if (ret < 0)
goto err;
if (smth)
goto err; // return 0 !!!!
err:
return ret;
}
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There is no reasons we need this cleanup code in generic
restore_one_task(), so let's move it for better readability.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The idea of the test is:
1) mmap separate page and put variable there, so that other usage does
not dirty these region. Initialize the variable with VALUE_A.
2) fork a child with special pid == CHILD_NS_PID. Only if it is a first
child overwrite the variable with VALUE_B.
3) wait for the end of the next predump or end of restore with
test_wait_pre_dump_ack/test_wait_pre_dump pair and kill our child.
Note: The memory region is "clean" in parent.
4) goto (2) unles end of cr is reported by test_waitpre
So on first iteration child with pid CHILD_NS_PID was dumped with
VALUE_B, on all other iterations and on final dump other child with the
same pid exists but with VALUE_A. But on all iterations after the first
one we have these memory region "clean". So criu before the fix would
have restored the VALUE_B taking it from first child's image, but should
restore VALUE_A.
Note: Child in its turn waits termination and performs a check that variable
value doesn't change after c/r.
We should run the test with at least one predump to trigger the problem:
[root@snorch criu]# ./test/zdtm.py run --pre 1 -k always -t zdtm/transition/pid_reuse
Checking feature ns_pid
Checking feature ns_get_userns
Checking feature ns_get_parent
=== Run 1/1 ================ zdtm/transition/pid_reuse
===================== Run zdtm/transition/pid_reuse in ns ======================
DEP pid_reuse.d
CC pid_reuse.o
LINK pid_reuse
Start test
Test is SUID
./pid_reuse --pidfile=pid_reuse.pid --outfile=pid_reuse.out
Run criu pre-dump
Send the 10 signal to 52
Run criu dump
Run criu restore
Send the 15 signal to 73
Wait for zdtm/transition/pid_reuse(73) to die for 0.100000
Test output: ================================
14:47:57.717: 11235: ERR: pid_reuse.c:76: Wrong value in a variable after restore
14:47:57.717: 4: FAIL: pid_reuse.c:110: Task 11235 exited with wrong code 1 (errno = 11 (Resource temporarily unavailable))
<<< ================================
https://jira.sw.ru/browse/PSBM-67502
v3: simplify waitpid's status check
v9: switch to test_wait_pre_dump(_ack)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If pre-dump-notify flag is set, zdtm sends a notify to the test after
pre-dump was finished and waits for the test to send back a reply that
test did all it's work and now is ready for a next pre-dump/dump.
How it can be used:
while (!test_wait_pre_dump()) {
/* Do something after predump */
test_wait_pre_dump_ack();
}
/* Do something after restore */
Internally we open two pipes for the test one for receiving notify (with
two open ends) and one for replying to it (only write end open). Fds of
pipes are dupped to predefined numbers and zdtm opens these fds through
/proc/<test-pid>/fd/{100,101} and communicates with the test.
v9: switch to two way interface to remove race then operation we try to
run after predump may be yet unfinished at the time of next dump.
Suggested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We have a problem when a pid is reused between consequent dumps we can't
understand if pagemap and pages from images of parent dump are invalid
to restore these pid already. That can lead even to wrong memory
restored for these pid, see the test in last patch.
So these is a try do separate processes with (likely) invalid previous
memory dump from processes with 100% valid previous dump.
For that we use the value of /proc/<pid>/stat's start_time and also the
timestamp of each (pre)dump. If the start time is strictly less than the
timestamp, that means that the pagemap for these pid from previous dump
is valid - was done for exactly the same process.
Creation time is in centiseconds by default so if predump is really fast
(<1csec) we can have false negative decisions for some processes, but in
case of long running processes we are fine.
https://jira.sw.ru/browse/PSBM-67502
v2: remove __maybe_unused for get_parent_stats; fix get_parent_stats to
have static typing; print warning only if unsure; check has_dump_uptime
v3: read parent stats from image only once; reuse stat from previous
parse_pid_stat call on dump
v4: move code to function; use unsigned long long for ticks; put
proc_pid_stat on mem_dump_ctl; print warning on all pid-reuse cases
v5: free parent's stats entry properly, pass it in arguments to
(pre_)dump_one_task
v6: free parent's stats in error path too
v7: zero init parent_se
v8: improve error message
v9: switch to inventory image from stats, if pid-reuse fails - fail
current dump
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: actually we need only one value from inventory entry but I still
prefer general helper as we still need to read and allocate memory
for the whole structure
v2: fix get_parent_stats to have static typing
v3: simplify get_parent_stats to return a StatsEntry pointer instead of
doing it through arguments
v8: replace errors with warnings, we should whatch on them only if we
have corresponding error in detect_pid_reuse else they are fine
v9: change stats to inventory image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We want to use a simple fact: If we have an alive process in a pstree we
want to dump, and a starttime of that process is less than pre-dump's
timestamp (taken while all processes were freezed), then these exact
process existed (100% sure) at the time of these pre-dump and the
process' memory was dumped in images.
So save inventory image on pre-dump and put there an uptime.
https://jira.sw.ru/browse/PSBM-67502
v9: improve comment, put uptime to ivnentory image as 1) where is no
stats in parent images directory if --work-dir option is set to
something different then images directory, 2) stats-dump is not an image
and it is a bad practice to put there data required for restoring.
v10:s/u_int64_t/uint64_t/
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: man for /proc/uptime says that uptime is in seconds and for now
the format is "seconds.centiseconds", where ecentiseconds is 2 digits
note: now uptime is in csec but I prefer saving it in usec, that allows
us to be reuse these image field when/if we have more accurate value.
v8: add length specifier to parse only centiseconds
v9: put uptime to u_int64_t directly, define CSEC_PER_SEC
v10: switch to uint64_t from u_int64_t, comment about usec in image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Leave dump_uptime in stats file for backward and forward compatibility
though it is unused now.
This reverts commit fbba4d249a.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>