2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-28 21:07:43 +00:00

4796 Commits

Author SHA1 Message Date
Pavel Emelyanov
35be2ee262 img: Don't return fd, return -1 instead
The same -- int-fd will soon go away, so return the
explicit int -1 instead of it.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:48:11 +04:00
Pavel Emelyanov
42821edccf img: Use errno when checking optional images open fail
There will be no int-fd soon, so one more preparation
to this fact.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:48:11 +04:00
Pavel Emelyanov
5f2a7ac27b img: Rename fdset -> imgset
Since we're going to switch from int-fd-s to class-image
soon the fdset name will not fit into the new terminology.

This patch is

 sed -e 's/fdset/imgset/g' -i *
 sed -e 's/imgset_fd/img_from_set/g' -i *
 git mv include/fdset.h include/imgset.h

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:48:10 +04:00
Pavel Emelyanov
1cb690ddc9 img: Move images IO helpers into .c file
This is to simplify the change from int fd to more
generic image class data-type.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:48:08 +04:00
Pavel Emelyanov
9d9ac53cd5 rst: Don't use write_img_buf for setting last_pid sysctl
The write_img_buf will be used only for images writing, while
in this place we just have a raw file descriptor.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:48:04 +04:00
Pavel Emelyanov
03482f69a2 img: Keep the copy of flags value in open_image_at
We drop the O_OPT from flags and will drop one more. So
instead of a set of bools let's have the flags copy at
hands.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
2014-09-30 21:47:57 +04:00
Cyrill Gorcunov
78bbb0a161 files-reg: Simplify have_seen_dead_pid
We've a special helper xrealloc_safe for reallocs.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-30 17:50:53 +04:00
Cyrill Gorcunov
1ef5060769 cgroup: Use xmalloc in rewrite_cgsets
We prefer x* helpers because they print error
in case of allocation failures.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-30 17:50:34 +04:00
Cyrill Gorcunov
c01efda8af bfd: timerfd -- Fix parsing typo
While been converting reading of data stream
to bfd the @buf member was left untouched leading
to incorrect data to be read, fix it setting up
proper one, ie @str itself, otherwise dumping
of timerfd files are failing.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-30 11:48:15 +04:00
Pavel Emelyanov
5eb39aad4d bfd: Multiple buffers management (v2)
I plan to re-use the bfd engine for images buffering. Right
now this engine uses one buffer that gets reused by all
bfdopen()-s. This works for current usage (one-by-pne proc
files access), but for images we'll need more buffers.

So this patch just puts buffers in a list and organizes a
stupid R-R with refill on it.

v2:
  Check for buffer allocation errors
  Print buffer mem pointer in debug

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-09-29 15:37:14 +04:00
Pavel Emelyanov
1a2e6cbd3f dump: Don't close pid-proc in vain
The open_pid_proc engine knows itself how to cache
per-pid descriptors. No need in closing it by hands.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:22:21 +04:00
Pavel Emelyanov
abeae2671b proc: Keep /proc/self cached separately from /proc/pid
When dumping tasks we do a lot of open_proc()-s and to
speed this up the /proc/pid directory is opened first
and the fd is kept cached. So next open_proc()-s do just
openat(cached_fd, name).

The thing is that we sometimes call open_proc(PROC_SELF)
in between and proc helpers cache the /proc/self too. As
the result we have a bunch of

  open(/proc/pid)
  close()
  open(/proc/self)
  close()

see-saw-s in the middle of dumping tasks.

To fix this we may cache the /proc/self separately from
the /proc/pid descriptor. This eliminates quite a lot
of pointless open-s and close-s.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:21:43 +04:00
Pavel Emelyanov
829d433252 fd: Close caches proc-pid stuff before restoring files
We have a bug. If someone opens proc with open_pid_proc or alike
with PROC_SELF of real PID before going to restore fds, then the
fd cached by proc helpers would be cached in fd 0 (we close all
fds beforehead) and it may clash with restored fds.

We don't hit this right now simply due to being too lucky -- we
call open_proc(PROC_GEN) on "locks" which first closes the cached
the per-pid descriptor and then reports back just the /proc one
which sits in service area.

But once we change this (next patch) things would get broken.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:21:31 +04:00
Pavel Emelyanov
1c8ab40e65 proc: Sanitate empty lines
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:21:23 +04:00
Pavel Emelyanov
e651a6eba4 filemap: Get vma mnt_id early
We have a, well, issue with how we calculate the vma's mnt_id.

Right now get one via criu side file descriptor that it got by
opening the /proc/pid/map_files/ link. The problem is that these
descriptors are 'merged' or 'borrowed' by adjacent vmas from
previous ones. Thus, getting the mnt_id value for each of them
makes no sense -- these files are the same.

So move this mnt_id getting earlier into vma parsing code. This
brings a potential problem -- if we have two adjacent vmas
mapping the same inode (dev:ino pair) but living in different
mount namespaces -- this check would produce wrong result.
"Wrong" from the perspective that on restore correct file would
be opened from wrong namespace.

I propose to live with it, since this is not worse than the
--evasive-devices option, it's _very_ unlikely, but saves a lot
of openeings.

Note, that in case app switched mount namespace and then mapped
some new library (with dlopen) things would work correctly -- new
vmas will likely be not adjacent and for different dev:ino.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:20:55 +04:00
Pavel Emelyanov
f84d19e09a vma: Add comments about some dump fields of vma_area
We have non-obvious handling of vm_file_fd/vm_socket_id
pair and the vma->file_borrowed.

Comment these to in the structure.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:20:20 +04:00
Pavel Emelyanov
cf8c9ae870 vma: Reshuffle the struct vma_area
We have some fields, that are dump-only and some that
are restore only (quite a lot of them actually).

Reshuffle them on the vma_area to explicitly show which
one is which. And rename some of them for easier grep.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-29 13:19:55 +04:00
Andrey Vagin
92ee123386 mntns: don't dump criu's namespace
Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-24 17:30:23 +04:00
Andrey Vagin
606bc93a1a bfd: move the optimization in a proper place
Currently this optimization skips unscanned data
and doesn't work. Lets skip scanned data only.

Reported-by: Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-24 15:12:04 +04:00
Pavel Emelyanov
cfce460b48 proc_parse: Rework timers parser to use bfd
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:49:16 +04:00
Pavel Emelyanov
cc4a67b3ed proc_parse: Rework smaps parser to use bfd
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:49:07 +04:00
Pavel Emelyanov
2c8af6b8e6 proc_parse: Rework fdinfo parser to use bfd
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:48:58 +04:00
Pavel Emelyanov
53771adcaa bfd: File-descriptors based buffered read
This sounds strange, but we kinda need one. Here's the
justification for that.

We heavily open /proc/pid/foo files. To speed things up we
do pid_dir = open("/proc/pid") then openat(pid_dir, foo).
This really saves time on big trees, up to 10%.

Sometimes we need line-by-line scan of these files, and for
that we currently use the fdopen() call. It takes a file
descriptor (obtained with openat from above) and wraps one
into a FILE*.

The problem with the latter is that fdopen _always_ mmap()s
a buffer for reads and this buffer always (!) gets unmapped
back on fclose(). This pair of mmap() + munmap() eats time
on big trees, up to 10% in my experiments with p.haul tests.

The situation is made even worse by the fact that each fgets
on the file results in a new page allocated in the kernel
(since the mapping is new). And also this fgets copies data,
which is not big deal, but for e.g. smaps file this results
in ~8K bytes being just copied around.

Having said that, here's a small but fast way of reading a
descriptor line-by-line using big buffer for reducing the
amount of read()s.

After all per-task fopen_proc()-s get reworked on this engine
(next 4 patches) the results on p.haul test would be

        Syscall     Calls      Time (% of time)
Now:
           mmap:      463  0.012033 (3.2%)
         munmap:      447  0.014473 (3.9%)
Patched:
         munmap:       57  0.002106 (0.6%)
           mmap:       74  0.002286 (0.7%)

The amount of read()s and open()s doesn't change since FILE*
also uses page-sized buffer for reading.

Also this eliminates some amount of lseek()s and fstat()s
the fdopen() does every time to catch up with file position
and to determine what sort of buffering it should use (for
terminals it's \n-driven, for files it's not).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:48:38 +04:00
Pavel
b30f0f0104 ns: Dump namespaces in parallel
The main reason for this is -- dumping namespace has a lot of
points when the process just waits for something. At the same
time criu process wait for the ns dumper and doesn't dump
others.

The great example of waiting for something is setns syscall.
Very often it calls synchronize_rcu() which can be quite long.
Let other processes do smth useful while this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-09-23 20:43:33 +04:00
Tycho Andersen
bbe3f941db remap: don't add remaps for a dead pid more than once
Unless we seek and re-read the PB images, the only way I can see to do this is
to keep a list of the previously seen dead pids and check if a new remap is in
that list.

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:39:28 +04:00
Tycho Andersen
80c4e86e87 remap: don't try to remap other files in /proc
We can't remap these files correctly anyway, so we should just return success
if we find one of these files to remap.

v2: don't try to remap accessible files in /proc

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:38:38 +04:00
Pavel
867bcd2196 mnt: Shorten the mntns dumping loop
We currently have all mouninfo-s from all mnt namespaces collected
in one big list. On dump we scan through it to find the namespaces
we need to dump.

This can be optimized by walking the list of namespaces instead.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-09-23 20:37:32 +04:00
Andrey Vagin
6382ed43f5 x86: don't call wait4 as waitpid
Fix compilation on ARM:
pie/restorer.c: In function ‘wait_helpers’:
pie/restorer.c:728:3: error: implicit declaration of function ‘sys_waitpid’ [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-23 20:36:44 +04:00
Pavel Emelyanov
ab50f6ac18 ptrace: Factor out pie stopping code
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Andrey Vagin <avagin@parallels.com>
2014-09-23 20:36:10 +04:00
Andrey Vagin
48fcc7994d ptrace: flush breakpoints
Unfortunately the kernel doesn't flush hw breakpoints on
detaching ptrace. If a breakpoint is triggered without ptrace, it
will be killed by SIGTRAP.

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 18:03:03 +04:00
Ruslan Kuprieiev
3b2ab35bc8 test: rpc: test page-server
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 16:30:48 +04:00
Ruslan Kuprieiev
a483cbda5c service: page-server: allow requesting page-server without setting any ps_info
Since we now can return port to user in autobind case, it's ok to request
page-server without setting ps_info.

Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 16:30:37 +04:00
Ruslan Kuprieiev
6b631faa4b service: page-server: return port back to user, v2
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 16:30:15 +04:00
Ruslan Kuprieiev
45fe2c9d6d page-server: assign opts.ps_port to sin_port in autobind case
Signed-off-by: Ruslan Kuprieiev <kupruser@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 12:55:12 +04:00
Andrew Vagin
13fc78b907 ptrace: say to parasite_stop_on_syscall where is we now
On restore parasite_stop_on_syscall() can be called after PTRACE_SYSCALL
and after a breakpoint. parasite_stop_on_syscall() must be called only
after PTRACE_SYSCALL, so all tests where is one process stuck.

Reported-by: Mr Jenkins
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 12:49:45 +04:00
Andrey Vagin
eda6b3d002 zdtm: don't call mount_cgroups a few times concurrently
Here is a race now:
./zdtm.sh --ct -d -C -x static/cgroup02 ns/static/pipe02 &> ns_static_pipe02.log || \
{ flock Makefile cat ns_static_pipe02.log; exit 1; }
./zdtm.sh --ct -d -C -x static/cgroup02 ns/static/busyloop00 &> ns_static_busyloop00.log || \
{ flock Makefile cat ns_static_busyloop00.log; exit 1; }
make[3]: `zdtm_ct' is up to date.
mkdir: cannot create directory ‘zdtm.GgIjUS/holder’: File exists

Reported-by: Mr Jenkins
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-22 12:15:45 +04:00
Andrey Vagin
248fc31531 restore: use breakpoints instead of tracing syscalls
Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().

We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:57:18 +04:00
Andrey Vagin
0b1b81512b dump: use breakpoints instead of tracing syscalls (v2)
Currently CRIU traces syscalls to catch a moment, when sigreturn() is
called. Now we trace recv(cmd), close(logfd), close(cmdfd), sigreturn().

We can reduce a number of steps by using hw breakpoints. A breakpoint is
set before sigreturn, so we will need to trace only it.

v2: In the first version a breakpoint is set after sigreturn. In this
case we have a problem with signals. If a process has pending signals,
it will start to precess them after exiting from sigreturn(), but before
returning to userspace. So the breakpoint will not be triggered.

And at the end Here are a few numbers how we catch sigreturn.
Before this patch criu executes 36 syscalls and gets 12 signals.
With this patch criu executes 18 syscalls and gets 5 signals.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:56:25 +04:00
Andrey Vagin
e46a9f6bfc parasite: send PARASITE_CMD_FINI before resuming the target process
The control socket has enough buffer for one command and the target
process will not wait a new command, so we will avoid extra context
switches.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:56:17 +04:00
Pavel Tikhomirov
8b9e18f07b zdtm: test for mlocked area restores if programm have no credentials
Test maps 17 pages and mlocks them, then changes user id from root
to 18943, after c/r checks that MAP_LOCKED bit is set for that vma.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:44:28 +04:00
Pavel Tikhomirov
2f85727410 zdtm: move get_smaps_bits to separate file for reuse
Signed-off-by: Pavel Tikhomirov <ptikhomirov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:44:26 +04:00
Tycho Andersen
e9d0499cd1 test: add a test for remap_dead_pid
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:42:49 +04:00
Tycho Andersen
f020bef776 remap: add a dead pid /proc remap
If a file like /proc/20/mountinfo is open, but 20 is a zombie (or doesn't exist
any more), we can't read this file at all, so a link remap won't work. Instead,
we add a new remap, called the dead process remap, which forks a TASK_HELPER as
that dead pid so that the restore task can open the new /proc/20/mountinfo
instead.

This commit also adds a new stage CR_STATE_RESTORE_SHARED. Since new
TASK_HELPERS are added when loading the shared resource images, we need to wait
to start forking tasks until after these resources are loaded.

v2: fix a mutex bug

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:42:48 +04:00
Tycho Andersen
c09ba04c48 restore: TASK_HELPERs live until RESTORE stage ends
In order to use TASK_HELPERS to open files from dead processes, they should
persist until criu is done restoring the filesystem, which happens in the
RESTORE stage. To do this, we need to pass each helper's PIDs to the restorer
blob, so that it can wait() on them when the restore stage is done.

This commit is in preparation for the remap_dead_pid commits.

v2: wait() on helpers after restore stage is over
v3: add CR_STATE_RESTORE_FS stage
v4: CR_STATE_RESTORE_FS waits for nr_tasks + nr_helpers, not nr_threads
v5: ditch CR_STATE_RESTORE_FS in favor of passing helpers to restorer blob

Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:42:46 +04:00
Andrey Vagin
5a101d83af mount: skip the criu's mount namespace if tasks live in another mntns
Currently here is a bug, because when we see criu's mount namespace,
we go to the "out" mark and don't validate mounts.

Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:40:26 +04:00
Pavel Emelyanov
1ebd56b024 proc: Don't use FILE * to reach children
The same reasoning as for personality file -- switch to
plan open + read + close.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:39:56 +04:00
Pavel Emelyanov
f3bee6d584 proc: Don't use FILE* for reading personality
It turned out, that fdopen (used in fopen_proc) always maps
a 4k buffer for reads and this buffer gets unmap-ed later
on fclose.

Taking into account the amount of proc files we read (~20
per task plus one file per opened file descriptor) this
mmap+munmap result in quite a lot of useless CPU time.

E.g. for a container of 20 tasks we have 1000 calls taking
~8% of total dump time.

So lets first stop doing this for simple cases -- one line
proc files.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:39:49 +04:00
Cyrill Gorcunov
d36c4058bc plugin: Explicit assign plugin hooks
So it won't depend on the order in declaration.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-19 17:39:06 +04:00
Pavel Emelyanov
cc2f2ebba4 helpers: Create helpers with shared files and fs
They don't change these objects, so can share them
with parent (will be created slightly faster :) ).

The plan is to make them CLONE_VM, but it's not that
easy.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-18 20:27:21 +04:00
Pavel Emelyanov
cc4492e1c6 rst: Don't allocate page for child stack (v2)
When clone-ing kids we can set their stack on current, as
it will anyway be COW-ed later. One thing to note -- we do
need to reserve some space on the stack for glibc's arguments
and retcode allocation. 128 bytes should be enough for 16
pointers while clone has 5 arguments.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
2014-09-18 20:27:06 +04:00