travis-ci: success for uffd: A new set of improvements
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Currently we allocate a single page to use as intermediate buffer for
holding data that will be used in UFFDIO_COPY. Let's allocate a buffer per
process and make that buffer large enough to hold the largest continuos
chunk.
travis-ci: success for uffd: A new set of improvements
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
page_read->seek_page was restored to skip zero pagemaps, therefore we
should check its return value rather than underlying PME.
travis-ci: success for uffd: A new set of improvements
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Inline relevant parts of get_page inside uffd_handle_page and call
uffd_{copy,zero}_page after we've got the data.
travis-ci: success for uffd: A new set of improvements
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
We will want to poll not only a bunch of uffd-s, but also the lazy
socket, so here's "an fd and a callback" object to be pushed into
epoll.
travis-ci: success for uffd: A new set of improvements
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Instead of tracking memory handled by userfaultfd on the page basis we can
use IOVs for continious chunks.
travis-ci: success for uffd: A new set of improvements
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Right now the zdtm.py hacks around core code and waits for
a second for the socket to appear. Let's better make proper
--daemon mode for lazy-pages daemon and pidfile generation.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Instead of creating mm-related parts of restore info in process tree we
can directly use MmEntry for VMA traversals.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Moving the find_vmas and collect_uffd_pages functions before they are
actually used. This allows to drop forward declaration of find_vmas and
will make subsequent refactoring cleaner.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The event received should be checked to be #PF before
accessing its other arguments.
[ Mike:
Well, looking forward to see non-cooperative userfaultfd patches in kernel
we should have something like
static int handle_uffd_enent(struct lazy_pages_info *lpi)
{
read(&msg...);
switch (msg.event) {
case UFFD_EVENT_PAGEFAULT:
handle_pagefault(lpi, msg);
break;
default:
return -1;
}
}
But since this patch is anyway is a bugfix: <ack>
]
travis-ci: success for uffd: A set of improvements over criu/uffd.c
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
After previous patch we no longer need this hash since
we don't need fd -> lpi conversion.
travis-ci: success for uffd: A set of improvements over criu/uffd.c
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
This helps us get lpi MUCH faster on #PF.
travis-ci: success for uffd: A set of improvements over criu/uffd.c
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
This avoids excessive memcpy() one instruction below.
travis-ci: success for uffd: A set of improvements over criu/uffd.c
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
In cases errno is being set, we need to use pr_perror() to print it.
In cases errno is not set, we should use pr_err().
pr_perror() doesn't need a colon or a newline. pr_err() needs a newline.
Cc: Adrian Reber <areber@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
travis-ci: success for Assorted nitpicks
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Use relative path for UNIX socket instead of absolute one.
This ensures we won't run into problems with invalid socket names.
travis-ci: success for lazy-pages: use relative path for UNIX socket
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Restore of a zombie process does not call setup_uffd which causes
lazy-pages daemon to stuck forever waiting for (pid, uffd) pair to arrive.
Let's extend the protocol between restore and lazy-pages so that for zombie
process a (0, -1) pair will be sent instead of actual (uffd, pid).
travis-ci: success for lazy-pages: misc fixes (rev4)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
travis-ci: success for lazy-pages: misc fixes (rev4)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
To properly handle zombie processes we will need to distinguish failures
coming from socket communications from absent userfault file descriptor
travis-ci: success for lazy-pages: misc fixes (rev4)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
When a VMA is mapped with MAP_LOCKED it is address space is populated with
pages which causes UFFDIO_COPY to return -EXISTS. Until we can find some
better solution let's avoid marking VMAs with MAP_LOCKED as lazy.
Fixes: #238
travis-ci: success for lazy-pages: misc fixes (rev3)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Only the UFFD daemon is aware if pages are in the parent or not. The
restore will continue to work as any lazy-restore except that pages from
parent checkpoints will be pre-populated by the restorer.
The restorer will still register the whole memory region as being
handled by userfaultfd even if it contains pages from parent
checkpoints. Userfaultfd page faults will only happen on pages which
contain no data. This means from the parent pre-populated pages will not
trigger a userfaultfd message even if marked as being handled by
userfaultfd.
The UFFD daemon knows about pages which are available in the parent
checkpoints and will not push those pages unnecessarily to userfaultfd.
Following steps to migrate a process are now possible:
Source system:
* criu pre-dump -D /tmp/cp/1 -t <PID>
* rsync -a /tmp/cp <destination>:/tmp
* criu dump -D /tmp/cp/2 -t <PID> --port 27 --lazy-pages \
--prev-images-dir ../1/ --track-mem
Destination system:
* rsync -a <source>:/tmp/cp /tmp/
* criu lazy-pages --page-server --address <source> --port 27 \
-D /tmp/cp/2 &
* criu restore --lazy-pages -D /tmp/cp/2
This will now restore all pages from the parent checkpoint if they
are not marked as lazy in the second checkpoint.
v2:
- changed parent detection to use pagemap_in_parent()
v3:
- unfortunately this reverts
c11cf95afbe023a2816a3afaecb65cc4fee670d7
"criu: mem: skip lazy pages during restore based on pagemap info"
To be able to split the VMA-s in the right chunks for the restorer
it is necessary to make the decision lazy or not on the VmaEntry
level.
v4:
- everything has changed thanks to Mike Rapoport's suggestion
- the VMA-s are no longer touched or split
- instead of over 100 lines of changes this is now two line patch
Signed-off-by: Adrian Reber <areber@redhat.com>
Acked-by: Mike Rapoprot <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Combining pre-copy (pre-dump) and post-copy (lazy-pages) mode showed a
problem in the function page_pipe_split_ppb(). The function is used to
split the page-pipe-buffer so that it only contains the IOVs request
from the restore side during lazy restore.
Unfortunately it only splits the leading IOVs out of the
page-pipe-buffer and not the trailing:
Before split for requested address 0x7f27284d1000:
page-pipe: ppb->iov 0x7f0f74d93040
page-pipe: 0x7f27282bb000 1
page-pipe: 0x7f27284d1000 1
page-pipe: 0x7f27284dd000 2
After split:
page-pipe: ppb->iov 0x7f0f74d93050
page-pipe: 0x7f27284d1000 1
page-pipe: 0x7f27284dd000 2
and:
page-pipe: ppb->iov 0x7f0f74d93040
page-pipe: 0x7f27282bb000 1
This patch keeps on splitting the page-pipe-buffer until it contains
only the requested address with the requested length.
After split (still trying to load 0x7f27284d1000):
page-pipe: ppb->iov 0x7f0f74d93050
page-pipe: 0x7f27284d1000 1
and:
page-pipe: ppb->iov 0x7f0f74d93040
page-pipe: 0x7f27282bb000 1
and:
page-pipe: ppb->iov 0x7f0f74d93060
page-pipe: 0x7f27284dd000 2
v2:
- moved declarations to the declaration block
Signed-off-by: Adrian Reber <areber@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Currently potentially lazy pages are not counted as written even if they
are dump into pages*img. Count these pages as "pages_written" when dump is
not going to skip writing lazy pages to disk.
travis-ci: success for criu: mem: count all pages actually written to image as "pages_written"
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
This translates pagemap flags into strings for easier readability.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Do the same here, the flags is now enough to tell hole
from pagemap.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
They are now the same and PE_PRESENT bit helps us distinguish
holes from pagemaps having pages inside.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
The ->write_hole and ->write_pagemap now look very much
alike, so let's merge them. This is preparatory patch
that makes holy type decision based on PE_FOO flags.
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Instead of checking whether the VMA containing a page can be lazy for each
page, skip the entire parts of pagemap that have PE_LAZY flag set.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The PE_PRESENT flags is always set for pagemap entries that have
corresponding pages in the pages*img. Pagemap entries describing a hole
either with zero page or with pages in the parent snapshot will no have
PE_PRESENT flag set.
Pagemap entry that may be lazily restored is a special case. For the lazy
restore from disk case, both PE_LAZY and PE_PRESENT will be set in the
pagemap, but for the remote lazy pages case only PE_LAZY will be set.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
With 'zero' and 'lazy' booleans replaced by the flags field in
PagemapEntry, it is required that page-xfer will be aware of the change.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Having three booleans in pagemap entry clues for usage of good old flags.
Replace 'zero' and 'lazy' booleans with flags and use flags for internal
tracking of in_parent value. Eventually, in_parent may be deprecated.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Create the socket early so that it will be available after restoring the
namespaces
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Very minimalistic at the moment, no remote pages and namesapces.
Still better than nothing :)
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Adrian Reber <areber@redhat.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
We'll use it in anon shmem dedup so we need to have access
to it in shmem.c
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The UNIX sockets do not like relative paths. Assuming both lazy-pages
daemon and restore use the same opts.work_dir, their working directory full
path will be the same.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The remote lazy pages variant can be run as follows:
src# criu dump -t <pid> --lazy-pages --port 9876 -D /tmp/1 &
src# while ! sudo fuser 9876/tcp ; do sleep 1; done
src# scp -r /tmp/1/ dst:/tmp/
dst# criu lazy-pages --page-server --address dst --port 9876 -D /tmp/1 &
dst# criu restore --lazy-pages -D /tmp/1
In a nutshell, this implementation of remote lazy pages does the following:
- dump collects the process memory into the pipes, transfers non-lazy pages
to the images or to the page-server on the restore side. The lazy pages
are kept in pipes for later transfer
- when the dump creates the page_pipe_bufs, it marks the buffers containing
potentially lazy pages with PPB_LAZY
- at the dump_finish stage, the dump side starts TCP server that will
handle page requests from the restore side
- the checkpoint directory is transferred to the restore side
- on the restore side lazy-pages daemon is started, it creates UNIX socket
to receive uffd's from the restore and a TCP socket to forward page
requests to the dump side
- restore creates memory mappings and fills the VMAs that cannot be handled
by uffd with the contents of the pages*img.
- restore registers lazy VMAs with uffd and sends the userfault file
descriptors to the lazy-pages daemon
- when a #PF occurs, the lazy-pages daemon sends PS_IOV_GET command to the dump
side; the command contains PID, the faulting address and amount of pages
(always 1 at the moment)
- the dump side extracts the requested pages from the pipe and splices them
into the TCP socket.
- the lazy-pages daemon copies the received pages into the restored process
address space
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
When appropriate, the lazy pages will no be written to the destination.
Instead, a pagemap entry for range of such pages will be marked with 'lazy'
flag.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The pages that are mapped to zero_page_pfn are not dumped but information
where are they located is required for lazy restore.
Note that get_pagemap users presumed that zero pages are not a part of the
pagemap and these pages were just silently skipped during memory restore.
At the moment I preserve this semantics and force get_pagemap to skip zero
pages.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Pagemap now is more friendly to random accesses, enable use of new APIs.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Fix CID 163485 (#2 of 2): Dereference null return value (NULL_RETURNS)
7. dereference: Dereferencing a pointer that might be null dest when
calling handle_user_fault.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
This will allow to split a ppb so that data residing at specified address
will be immediately available
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
for buffers that contain potentially lazy pages
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>