Make sure we handle various corner cases:
* we received less pages than requested
* the request was capped because of unmap/remap etc
* the process has exited underneath us
Currently we are freeing the request once we've found the address to use
with uffd_copy(). Instead, let's keep the request object around, use it to
properly calculate number of pages we pass to uffd_copy() and then re-add
tailing range (if any) to the IOVs list.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of pre-parsing command line twice, one time to detect -h/--help and
another time to find config file parameter, check for both in one pass.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When config parsing was split into a separate part the handling of
-h/--help option during init_config was broken. Fix it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
s/first_count/global_cfg_argc
s/second_count/user_cfg_argc
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently kerndat_init() runs before command line parsing and running
simple 'criu --version' command may produce something like:
Warn (criu/kerndat.c:847): Can't load /run/criu.kdat
Error (criu/util.c:842): exited, status=3
Error (criu/util.c:842): exited, status=3
Write 4294967295 to /proc/self/loginuid failed: Operation not permittedWarn
(criu/net.c:2732): Unable to get socket network namespace
Warn (criu/net.c:2732): Unable to get tun network namespace
Warn (criu/sk-unix.c:213): sk unix: Unable to open a socket file:
Operation not permitted
Error (criu/net.c:3023): Unable create a network namespace: Operation not
permitted
Warn (criu/net.c:3069): NSID isn't reported for network links
Version: 3.6
GitID: v3.6-611-g0b27d0a
Group early calls to kerndat_* and init_service_fd calls into a function
and call this function after the command line parsing is finished.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This function is an analogue to vsprintf(), and is used in very much the
same way. The caller expects the modified string pointer to be pointing to
a null-terminated string.
Signed-off-by: Joel Nider <joeln@il.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The string returned by std_vprint_num() is right-aligned in the buffer.
Therefore, we must print the string starting from the pointer returned in
the 'ps' argument, and not from the start of the original buffer.
Signed-off-by: Joel Nider <joeln@il.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of merging unfinished requests with child's IOVs we queued them
into parent's IOV list. Fix it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Commit 9cb20327aa4 ("return to epoll_wait after completing forks") was only
half way there. Adding the other half.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In file included from s390x_gs_threads.c:10:0:
../lib/lock.h: In function 'mutex_lock':
../lib/lock.h:148:4: error: implicit declaration of function 'pr_perror' [-Werror=implicit-function-declaration]
pr_perror("futex");
Reported-by: Mr Jenkins
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We shouldn't set MAKEFLAGS by the following reasons:
1. User may want to specify some make parameter (e.g., `-d` for debug)
2. We lose parallel build. No `-j` is passed to submake and it looks
like, gnu/make will not deal with parallel recursive make if
$(MAKEFLAGS) is unset back.
Easy to verify: Add `sleep 3` to build rule in Makefile.inc and
you'll find only one sleep process at a time. After the patch
if you specify say `-j5` to make - you'll have 5 sleep processes.
Reverts: commit e9beed7bb3f3 ("build: zdtm -- Add implicit rules into
zdtm building").
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Let's drop usage of COMPILE.c, OUTPUT_OPTION.
It will allow run submake with -R.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
$(MAKEFLAGS) already contains -r -R and --no-print-directory: those
flags are being added in include.mk.. which is included two lines above.
There is no comment and I see no big sense in erasing $(MAKEFLAGS),
rather than adding those flags. So I considered this as a typo.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
A set of images from criu dump can be used as a previous point, when we
are doing snapshots. In this case, each point contains a full set of
images.
https://github.com/checkpoint-restore/criu/issues/479
v2: return -1 if invertory_save_uptime failed
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Here is one of often mistakes:
int funcX()
{
int ret;
ret = funcA()
if (ret < 0)
goto err;
if (smth)
goto err; // return 0 !!!!
err:
return ret;
}
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There is no reasons we need this cleanup code in generic
restore_one_task(), so let's move it for better readability.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The idea of the test is:
1) mmap separate page and put variable there, so that other usage does
not dirty these region. Initialize the variable with VALUE_A.
2) fork a child with special pid == CHILD_NS_PID. Only if it is a first
child overwrite the variable with VALUE_B.
3) wait for the end of the next predump or end of restore with
test_wait_pre_dump_ack/test_wait_pre_dump pair and kill our child.
Note: The memory region is "clean" in parent.
4) goto (2) unles end of cr is reported by test_waitpre
So on first iteration child with pid CHILD_NS_PID was dumped with
VALUE_B, on all other iterations and on final dump other child with the
same pid exists but with VALUE_A. But on all iterations after the first
one we have these memory region "clean". So criu before the fix would
have restored the VALUE_B taking it from first child's image, but should
restore VALUE_A.
Note: Child in its turn waits termination and performs a check that variable
value doesn't change after c/r.
We should run the test with at least one predump to trigger the problem:
[root@snorch criu]# ./test/zdtm.py run --pre 1 -k always -t zdtm/transition/pid_reuse
Checking feature ns_pid
Checking feature ns_get_userns
Checking feature ns_get_parent
=== Run 1/1 ================ zdtm/transition/pid_reuse
===================== Run zdtm/transition/pid_reuse in ns ======================
DEP pid_reuse.d
CC pid_reuse.o
LINK pid_reuse
Start test
Test is SUID
./pid_reuse --pidfile=pid_reuse.pid --outfile=pid_reuse.out
Run criu pre-dump
Send the 10 signal to 52
Run criu dump
Run criu restore
Send the 15 signal to 73
Wait for zdtm/transition/pid_reuse(73) to die for 0.100000
Test output: ================================
14:47:57.717: 11235: ERR: pid_reuse.c:76: Wrong value in a variable after restore
14:47:57.717: 4: FAIL: pid_reuse.c:110: Task 11235 exited with wrong code 1 (errno = 11 (Resource temporarily unavailable))
<<< ================================
https://jira.sw.ru/browse/PSBM-67502
v3: simplify waitpid's status check
v9: switch to test_wait_pre_dump(_ack)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If pre-dump-notify flag is set, zdtm sends a notify to the test after
pre-dump was finished and waits for the test to send back a reply that
test did all it's work and now is ready for a next pre-dump/dump.
How it can be used:
while (!test_wait_pre_dump()) {
/* Do something after predump */
test_wait_pre_dump_ack();
}
/* Do something after restore */
Internally we open two pipes for the test one for receiving notify (with
two open ends) and one for replying to it (only write end open). Fds of
pipes are dupped to predefined numbers and zdtm opens these fds through
/proc/<test-pid>/fd/{100,101} and communicates with the test.
v9: switch to two way interface to remove race then operation we try to
run after predump may be yet unfinished at the time of next dump.
Suggested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We have a problem when a pid is reused between consequent dumps we can't
understand if pagemap and pages from images of parent dump are invalid
to restore these pid already. That can lead even to wrong memory
restored for these pid, see the test in last patch.
So these is a try do separate processes with (likely) invalid previous
memory dump from processes with 100% valid previous dump.
For that we use the value of /proc/<pid>/stat's start_time and also the
timestamp of each (pre)dump. If the start time is strictly less than the
timestamp, that means that the pagemap for these pid from previous dump
is valid - was done for exactly the same process.
Creation time is in centiseconds by default so if predump is really fast
(<1csec) we can have false negative decisions for some processes, but in
case of long running processes we are fine.
https://jira.sw.ru/browse/PSBM-67502
v2: remove __maybe_unused for get_parent_stats; fix get_parent_stats to
have static typing; print warning only if unsure; check has_dump_uptime
v3: read parent stats from image only once; reuse stat from previous
parse_pid_stat call on dump
v4: move code to function; use unsigned long long for ticks; put
proc_pid_stat on mem_dump_ctl; print warning on all pid-reuse cases
v5: free parent's stats entry properly, pass it in arguments to
(pre_)dump_one_task
v6: free parent's stats in error path too
v7: zero init parent_se
v8: improve error message
v9: switch to inventory image from stats, if pid-reuse fails - fail
current dump
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: actually we need only one value from inventory entry but I still
prefer general helper as we still need to read and allocate memory
for the whole structure
v2: fix get_parent_stats to have static typing
v3: simplify get_parent_stats to return a StatsEntry pointer instead of
doing it through arguments
v8: replace errors with warnings, we should whatch on them only if we
have corresponding error in detect_pid_reuse else they are fine
v9: change stats to inventory image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We want to use a simple fact: If we have an alive process in a pstree we
want to dump, and a starttime of that process is less than pre-dump's
timestamp (taken while all processes were freezed), then these exact
process existed (100% sure) at the time of these pre-dump and the
process' memory was dumped in images.
So save inventory image on pre-dump and put there an uptime.
https://jira.sw.ru/browse/PSBM-67502
v9: improve comment, put uptime to ivnentory image as 1) where is no
stats in parent images directory if --work-dir option is set to
something different then images directory, 2) stats-dump is not an image
and it is a bad practice to put there data required for restoring.
v10:s/u_int64_t/uint64_t/
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
will be used in the next patch
https://jira.sw.ru/browse/PSBM-67502
note: man for /proc/uptime says that uptime is in seconds and for now
the format is "seconds.centiseconds", where ecentiseconds is 2 digits
note: now uptime is in csec but I prefer saving it in usec, that allows
us to be reuse these image field when/if we have more accurate value.
v8: add length specifier to parse only centiseconds
v9: put uptime to u_int64_t directly, define CSEC_PER_SEC
v10: switch to uint64_t from u_int64_t, comment about usec in image
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This reverts commit cf2f035d9f5cdca96c814bd26e24c556ad736171.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Leave dump_uptime in stats file for backward and forward compatibility
though it is unused now.
This reverts commit fbba4d249a49e34e41c7c63ed77fab1bee3a13de.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This reverts commit 4a43486e24cf543ed2c0320552087f506c51635f.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
This reverts commit ffd415a5b5b8a9ef8fb99904a3c9b04ecdb3052b.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Service descriptros can be moved in a child process.
v2: handle errors of install_service_fd() properly
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Include deps files to recompile tests when dependency has changed.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
man 2 futex:
In the event of an error (and assuming that futex() was invoked via
syscall(2)), all operations return -1 and set errno to indicate the
cause of the error.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Though LOG_FD_OFF < IMG_FD_OFF, get_service_fd(LOG_FD_OFF) is > than
get_service_fd(IMG_FD_OFF), see __get_service_fd, so the check here
should be twisted. Also add bug_on to track possible __get_service_fd
change which can break these check again.
We have a problem when USERNSD_SK replaces LOG_FD_OFF, latter when
writing to log, instead we actually send crazy commands to usernsd,
which fails to handle them and BUGs or crashes.
https://jira.sw.ru/browse/PSBM-83472
Also we had similar problem when __userns_call receives bad repsonse,
likely it has the same background:
https://api.travis-ci.org/v3/job/352164661/log.txt
fixes commit 129bb14611c3 ("files: Prepare clone_service_fd() for
overlaping ranges.")
v2: move BUG_ON to main() to check it only once, use min+1 and max-1
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Unnamed temporary files are restored as ghost files.
If O_TMPFILE is set for the open() syscall, the pathname argument
specifies a directory, but criu gives a path to a ghost file.
(00.107450) 36: Error (criu/files-reg.c:1757): Can't open file tmp/#42274874 on restore: Not a directory
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
man 2 open:
...
O_TMPFILE (since Linux 3.11)
Create an unnamed temporary file. The pathname argument speci‐ fies a
directory; an unnamed inode will be created in that directory's
filesystem. Anything written to the resulting file will be lost when
the last file descriptor is closed, unless the file is given a name.
...
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It is possible that when pages request from the remove source arrive, part
of the memory range covered by the request would be already gone because of
madvise(MADV_DONTNEED), mremap() etc.
Ensure we are not trying to uffd_copy more than we are allowed.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If we get fork() event just before transferring last IOV of the parent
process, continuing to background fetch after completing fork event
handling will cause lazy-pages daemon to exit and nothing will monitor the
child process memory.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Since the memory mapping is now split between ->iovs and ->reqs lists, any
update to memory layout should take into account both lists.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of recalculating required for lazy_pages_info->buf when copying
IOVs at fork() time, keep the size of the buffer in the lazy_pages_info
struct.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When we return from epoll_run_rfds with positive return value it means that
event handling loop was interrupted because the event should be handled
outside of that loop. Is always the case with UFFD_EVENT_FORK.
It may happen that the event occurred after we've completed the memory
transfer and we are on the way to successful return from the
handle_requests() function, but instead of returning 0 we will return the
positive value we've got from epoll_run_rfds.
Explicitly assigning return value of complete_forks() fixes this issue.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
With userfaultfd we cannot reliably service process_vm_readv calls. The
maps007 test that uses these calls passed previously by sheer luck.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>