Tracking cpuid features is easier when sync'ed with kernel
source code. Note though that while in kernel feature bits
are not part of ABI, we're saving bits into an image so
as result make sure they are posted in proper place together
with keeping in mind the backward compatibility issue.
Here we also start using v2 of cpuinfo image with more
feature bits.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Send pre-dump notify to 36
Traceback (most recent call last):
File "zdtm.py", line 2161, in <module>
do_run_test(tinfo[0], tinfo[1], tinfo[2], tinfo[3])
File "zdtm.py", line 1549, in do_run_test
cr(cr_api, t, opts)
File "zdtm.py", line 1264, in cr
test.pre_dump_notify()
File "zdtm.py", line 490, in pre_dump_notify
fdin.write(struct.pack("i", 0))
TypeError: write() argument 1 must be unicode, not str
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The idea of the test is:
1) mmap separate page and put variable there, so that other usage does
not dirty these region. Initialize the variable with VALUE_A.
2) fork a child with special pid == CHILD_NS_PID. Only if it is a first
child overwrite the variable with VALUE_B.
3) wait for the end of the next predump or end of restore with
test_wait_pre_dump_ack/test_wait_pre_dump pair and kill our child.
Note: The memory region is "clean" in parent.
4) goto (2) unles end of cr is reported by test_waitpre
So on first iteration child with pid CHILD_NS_PID was dumped with
VALUE_B, on all other iterations and on final dump other child with the
same pid exists but with VALUE_A. But on all iterations after the first
one we have these memory region "clean". So criu before the fix would
have restored the VALUE_B taking it from first child's image, but should
restore VALUE_A.
Note: Child in its turn waits termination and performs a check that variable
value doesn't change after c/r.
We should run the test with at least one predump to trigger the problem:
[root@snorch criu]# ./test/zdtm.py run --pre 1 -k always -t zdtm/transition/pid_reuse
Checking feature ns_pid
Checking feature ns_get_userns
Checking feature ns_get_parent
=== Run 1/1 ================ zdtm/transition/pid_reuse
===================== Run zdtm/transition/pid_reuse in ns ======================
DEP pid_reuse.d
CC pid_reuse.o
LINK pid_reuse
Start test
Test is SUID
./pid_reuse --pidfile=pid_reuse.pid --outfile=pid_reuse.out
Run criu pre-dump
Send the 10 signal to 52
Run criu dump
Run criu restore
Send the 15 signal to 73
Wait for zdtm/transition/pid_reuse(73) to die for 0.100000
Test output: ================================
14:47:57.717: 11235: ERR: pid_reuse.c:76: Wrong value in a variable after restore
14:47:57.717: 4: FAIL: pid_reuse.c:110: Task 11235 exited with wrong code 1 (errno = 11 (Resource temporarily unavailable))
<<< ================================
https://jira.sw.ru/browse/PSBM-67502
v3: simplify waitpid's status check
v9: switch to test_wait_pre_dump(_ack)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Commit 37e4c7bfc264 fixed arm, ppc, x86 (32bit),
while it made wrong definition of x86_64. Fix that.
Also, add commentary to raw fork() implementation.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The criu_status_in is not always used and it may be -1 when the signal
handler closes it. With lazy-pages we hit a corner case which clobbers the
errno value. This happens when we resume the process inside glibc syscall
wrapper and get the signal before the page containing errno is copied. In
this case, signal handler is invoked before the syscall return value is
written to errno and the actual value of errno seen by the process becomes
-EBADF because of close(-1) in the signal handler.
Let's ensure that close() in signal handler does not fail to make Jenkins
happier while the proper solution for the lazy-pages issue is found.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The kerndat_init() is now called before the jump to action handler. This
allows us to directly use kdat without calling to the corresponding
kerndat_*() methods.
✓ travis-ci: success for lazy-pages: update checks for availability of userfaultfd (rev3)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
On fedora rawhide seccomp_metadata for some
reason is not defined (while in kernel it introduced
together with PTRACE_SECCOMP_GET_METADATA). So
lets do a trick for a while -- define own alias.
Once system headers get settled down we might find
more suitable solution. Because it's a part of kernel
API we're on the safe side.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
After CR_STATE_RESTORE_SIGCHLD stage triggered we are
not allowed to exit, just yield the BUG instead.
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Looking up for pid in nesting pidns supposed to be done
for non group leaders only, thus __export_restore_thread
do this check on its own and we don't have to make
a similar lookup especially on group leader where
pids in args never were valid.
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Andrew proposed the test which actually triggered the issue
in current seccomp series, put it into a regular basis.
Suggested-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When considering if we to call PTRACE_O_SUSPEND_SECCOMP
on the tid we should take into account if there at least
one thread which has seccomp mode enabled, otherwise
we might miss filter suspension and restore procedure
might break due to own criu syscall get filtered out.
Same time we should move seccomp restore for threads
to take place after CR_STATE_RESTORE_SIGCHLD state
so that main criu code will attach to threads and
setup seccomp suspension flag before we start
restoring the filters.
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
To checkpoint per-thread seccomp filters we need
a significant rework of a dumping code. The general
idea is the following:
- Each thread is tracked by its tid inside global
seccomp rbtree thus we can easily add entries
there or lookup on demand.
- When we collect threads into pstree entries we fetch
its seccomp mode from procfs parsing routine and allocate
a new entry inside rbtree to remember the seccomp mode.
Note at this moment we're not dumping real filters yet
(because filter data image is a single one for all consumers)
- Once all tids are collected and our tree is complete we call for
seccomp_collect_dump_filters helper which walks every pstree entry
and iterate over each tid inside thread group calling
seccomp_dump_thread, which in turn uses ptrace engine to fetch
filters and keep this data in memory.
To optimize data usage we figure out if we can use TSYNC flag
on restore calling try_use_tsync helper: for TSYNC flag kernel
automatically propagate filter to all threads, thus we need to
compare all filters inside thread group for identity since there
is no other way to figure out if user passed TSYNC flag when
been creating filters.
- Finally dump_seccomp_filters is called which does real write
of seccomp filter data into an image file.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
At now we pretend that all threads are sharing seccomp chains
and at checkpoint moment we test seccomp modes to make sure
if this assumption is valid refusing to dump otherwise.
Still the kernel tacks seccomp filter chains per each thread
and now we've faced applications (such as java) where per-thread
chains are actively used. Thus we need to bring support of handling
filters via per-thread basis.
In this a bit intrusive patch the restore engine is lifted up
to treat each thread separately. Here what is done:
- Image core file is modified to keep seccomp filters
inside thread_core_entry. For backward compatibility
former seccomp_mode and seccomp_filter members in
task_core_entry are renamed to have old_ prefix and
on restore we test if we're dealing with old images.
Since per-thread dump is not yet implemeneted the
dumping procedure continue operating with old_ members.
- In pie restorer code memory containing filters are addressed
from inside thread_restore_args structure which now
contains seccomp mode itself and chain attributes
(number of filters and etc).
Reading of per-thread data is done in seccomp_prepare_threads
helper -- we take one pstree_item and walks over every thread
inside to allocate pie memory and pin data there.
Because of PIE specific, before jumping into pie code
we have to relocate this memory into new place and
for this seccomp_rst_reloc is served.
In restorer itself we check if thread_restore_args provides
us enabled seccomp mode (strict or filter passed) and call
for restore_seccomp_filter if needed.
- To unify names we start using seccomp_ prefix for all related
stuff involved into this change (prepare_seccomp_filters renamed
to seccomp_read_image because it only reads image and nothing
more, image handler is renamed to seccomp_img_entry instead
of too short 'se'.
With this change we're now allowed to start collecting and
dumping seccomp filters per each thread, which will be
done in next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Note that there is no real usage of this flag on restore,
we simply save it in image and will make a real use
later.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This header is main place for all seccomp related
structures so move seccomp_info here. This will
allow to minimize changes area when need to update
definitions and such.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We will use it to figure out if filter log target is used.
Metadata associated with seccomp filter is relatively new
feature which allows userspace to get and set it back.
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If pre-dump-notify flag is set, zdtm sends a notify to the test after
pre-dump was finished and waits for the test to send back a reply that
test did all it's work and now is ready for a next pre-dump/dump.
How it can be used:
while (!test_wait_pre_dump()) {
/* Do something after predump */
test_wait_pre_dump_ack();
}
/* Do something after restore */
Internally we open two pipes for the test one for receiving notify (with
two open ends) and one for replying to it (only write end open). Fds of
pipes are dupped to predefined numbers and zdtm opens these fds through
/proc/<test-pid>/fd/{100,101} and communicates with the test.
v9: switch to two way interface to remove race then operation we try to
run after predump may be yet unfinished at the time of next dump.
Suggested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We have a problem when a pid is reused between consequent dumps we can't
understand if pagemap and pages from images of parent dump are invalid
to restore these pid already. That can lead even to wrong memory
restored for these pid, see the test in last patch.
So these is a try do separate processes with (likely) invalid previous
memory dump from processes with 100% valid previous dump.
For that we use the value of /proc/<pid>/stat's start_time and also the
timestamp of each (pre)dump. If the start time is strictly less than the
timestamp, that means that the pagemap for these pid from previous dump
is valid - was done for exactly the same process.
Creation time is in centiseconds by default so if predump is really fast
(<1csec) we can have false negative decisions for some processes, but in
case of long running processes we are fine.
https://jira.sw.ru/browse/PSBM-67502
v2: remove __maybe_unused for get_parent_stats; fix get_parent_stats to
have static typing; print warning only if unsure; check has_dump_uptime
v3: read parent stats from image only once; reuse stat from previous
parse_pid_stat call on dump
v4: move code to function; use unsigned long long for ticks; put
proc_pid_stat on mem_dump_ctl; print warning on all pid-reuse cases
v5: free parent's stats entry properly, pass it in arguments to
(pre_)dump_one_task
v6: free parent's stats in error path too
v7: zero init parent_se
v8: improve error message
v9: switch to inventory image from stats, if pid-reuse fails - fail
current dump
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: actually we need only one value from inventory entry but I still
prefer general helper as we still need to read and allocate memory
for the whole structure
v2: fix get_parent_stats to have static typing
v3: simplify get_parent_stats to return a StatsEntry pointer instead of
doing it through arguments
v8: replace errors with warnings, we should whatch on them only if we
have corresponding error in detect_pid_reuse else they are fine
v9: change stats to inventory image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We want to use a simple fact: If we have an alive process in a pstree we
want to dump, and a starttime of that process is less than pre-dump's
timestamp (taken while all processes were freezed), then these exact
process existed (100% sure) at the time of these pre-dump and the
process' memory was dumped in images.
So save inventory image on pre-dump and put there an uptime.
https://jira.sw.ru/browse/PSBM-67502
v9: improve comment, put uptime to ivnentory image as 1) where is no
stats in parent images directory if --work-dir option is set to
something different then images directory, 2) stats-dump is not an image
and it is a bad practice to put there data required for restoring.
v10:s/u_int64_t/uint64_t/
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: man for /proc/uptime says that uptime is in seconds and for now
the format is "seconds.centiseconds", where ecentiseconds is 2 digits
note: now uptime is in csec but I prefer saving it in usec, that allows
us to be reuse these image field when/if we have more accurate value.
v8: add length specifier to parse only centiseconds
v9: put uptime to u_int64_t directly, define CSEC_PER_SEC
v10: switch to uint64_t from u_int64_t, comment about usec in image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This makes it possible to have the pageserver communication go over anonymous
unix sockets, e.g. created by socketpair().
Such setup makes it easier to secure pageserver connection by wrapping
it in an encrypted tunnel. It also helps prevent attacks where
a malicious process connects to page server and injects its own
stream of pages to either fool criu into restoring wrong pages or
to DoS the pageserver by having it exhaust local storage by writing
large .img files.
Signed-off-by: Pawel Stradomski <pstradomski@google.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In case if master peer of veth device is laying inside
node root net-ns we should not request device index
but rather allow the kernel to number it automatically.
When there is separate net-ns for master peer it should
be safe to request an index though.
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>