Move the function and reduce its arguments number.
This is cleanup needed to keep all tty code together.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Unchanged test provided by Andrew.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In case we have mounts:
1 /mnt/
2 /mnt/a with parent 1
3 /mnt/a/b with parent 1
4 /mnt/a with parent 2
We determine 2 as needing remap with does_mnt_overmount() and remap it.
Next we mount 4 on top of 2. Next in fixup_remap_mounts() we want to
move 2 back to it's parent 1, but instead move 4 there. So in these case
children-overmounts need to be remapped too.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Remaps in mnt_remap_list should follow same descending order which was
setup in mnt_resort_siblings(), so don't reorder them.
For instance if we have sibling mounts with mountpoints:
1) /dir1/dir2/dir3
2) /dir1/dir2
3) /dir1
Here (2) is sibling-overmount for (1). Mount (3) is sibling-overmount
for both (1) and (2). So when we move overmounts back in
fixup_remap_mounts() we should first move (2) and only then (3).
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
We should add new entry _before_ first entry with less depth to sort in
descending order.
e.g: entries in list have depths [7,5,3], adding new entry m with depth
4 we would break list_for_each_entry loop on p with depth 3, before
patch we would get [7,5,3,4] after list_add, which is wrong.
Also we can relax "<=" check to "<" to avoid unnecessary reordering.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
dump of VZ7 ct fails, if we have overmounted tmpfs inside:
[root@silo ~]# prlctl enter su-test-2
entered into CT
CT-829e7b28 /# mkdir /mnt/overmntedtmp
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt/overmntedtmp/
CT-829e7b28 /# mount -t tmpfs tmpfs /mnt
CT-829e7b28 /# logout
[root@silo ~]# prlctl suspend su-test-2
Suspending the CT...
Failed to suspend the CT: PRL_ERR_VZCTL_OPERATION_FAILED (Details: Will skip in-flight TCP connections
(01.657913) Error (criu/mount.c:1202): mnt: Can't open ./mnt/overmntedtmp: No such file or directory
(01.662528) Error (criu/util.c:709): exited, status=1
(01.664329) Error (criu/util.c:709): exited, status=1
(01.664694) Error (criu/cr-dump.c:2005): Dumping FAILED.
Failed to checkpoint the Container
All dump files and logs were saved to /vz/private/829e7b28-f204-4bce-b09f-d203b99befd4/dump/Dump.fail
Checkpointing failed
)
Criu wants to dump the contents of /mnt/overmntedtmp/ mount but it is
unavailable. So we copy the mount namespace in such a case and unmount
overmounts to access what we want to dump.
Actual usecase here is dumping CT with active mariadb and ssh
connection. Together they happen to create such overmount. As by default
systemd creates a separate mount namespace for mysql and also mounts
tmpfs to /run/user in it, and when ssh(root) is connected - systemd also
mounts tmpfs in container root mount namespace to /run/user/0 for user
files. As /run is slave mount /run/user/0 also propagates to mysql's
mount namespace and initially becomes overmounted by /run/user.
https://jira.sw.ru/browse/PSBM-57362
remove __maybe_unused for mnt_is_overmounted and umount_overmounts
changes in v2:
1) Use clone not fork, share resources with parent same as in
call_in_child_process.
2) Do not enter userns (create helper) for non-overmounted mounts. Thus
return back setns/resorens logic.
3) Helper opens fd for parent directly due to CLONE_FILES, remove futex.
4) Check helper exit status properly.
5) Add get_clean_fd helper.
6) Add better comments.
changes in v3:
1) Pass fd from helper through args instead of ret code, fix ret code
checking.
2) Add \n to pr_err in open_mountpoint
changes in v5:
Make comments even better.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
also remove __maybe_unused for __umount_children_overmounts
note: leave it __maybe_unused yet
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
It has two arguments "pos_l and "pos_h" instead of one "off". It is used
to handle 64-bit offsets on 32-bit kernels.
SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
https://github.com/checkpoint-restore/criu/issues/424
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Due to way CRIU handles paths (as relative to workdir), there's a case,
where migration would fail. Simple example is a ghost file in filesystem
root (with root being cwd). For example, "/unlinked" becomes "unlinked".
And original code piece scans path for other slashes, which would be
missing in this case. But it's still a perfectly valid case, and there's
no need to fail. So if there's no parent dir - we just don't need to
create one and we can just return 0 here instead of failing.
Signed-off-by: Vitaly Ostrosablin <vostrosablin@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The kernel doesn't have an interface to get a sent queue for udp
sockets, so currently we can't dump them and criu dump has to fail in
such cases.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Now we block all sockets with non-zero idiag_wqueue, but it doesn't mean
that a CORK option is enabled for a socket. A packet can be in a network
stack and it is accounted into idiag_wqueue.
https://github.com/checkpoint-restore/criu/issues/409
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Now it's probably one valide use case, because there is no way to commit
a container when a container is being checkpointed.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
All static tests has to stop any activity before C/R.
./tempfs_subns --pidfile=tempfs_subns.pid --outfile=tempfs_subns.out --dirname=tempfs_subns.test
Run criu dump
Unable to kill 128: [Errno 3] No such process
Run criu restore
7: Old mounts lost: []
7: New mounts appeared: [('/rootfs/criu/test', '/'), ('/', '/proc'), ('/', '/dev/pts')]
:
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Except for several false positives done by:
find -type f -name "*.c" -not -path "./test/*" -exec sed -i
's/\(\<pr_err.*[^\][^n]\)\("[,)]\)/\1\\n\2/g' {} \;
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently we use an additional pipe to steal data from a pipe, but we
don't check that we steal all data. And the additional pipe can have a
smaller size.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
a) As we shmalloced rpath it can not be xfreed. b) As we shmalloc
variables in collect_remap_ghost() far away from open_remap_ghost()
where we want to free them, there is no guaranty that our shmalloc was
last and we can't use shfree_last().
fixes commit 0c675a5e9d40 ("files: remove link_remaps when everything
has been restored")
When create_ghost() fails for some reason that produces a segfault for me.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reading and writing large buffers may result in short read/write. In cases
we expect the entire buffer to be transferred use {read,write}_data rather
than plain read/write syscalls.
Reported-by: Mr Jenkins
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Some tests expect that all the data will be handled in a single invocation
of read/write system call. But it possible to have short read/write on a
loaded system and this is not an error.
Add helper functions that will reliably read/write the entire buffer.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently we use the "map_files/%p-%p" format, but actually it should
be "map_files/%lx-%lx".
The kernel could handle both formats, but recently Alexey Dobriyan fixed
the kernel and it accept only the second format.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If we start backroung memory fetch before restore is completely finished,
we may try to write to the memory areas which were not yet remapped to
proper place and are not registered with userfaultfd.
Add synchronization between restore and the lazy-pages so that lazy-pages
will only handle #PFs before all the tasks are restored.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The remote page read has nothing to do if the page-server on the source has
closed the connection. Just report an error and abort.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Currently when we poll a file descriptor, we only process EPOLLIN events
and if a connection is closed the receiving side has no means to deal with
it.
Add a callback for EPOLL{RD}HUP events and the default implementation for
these events that removes the file descriptor from epoll and closes it.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
A bit more readable and will be easy to distinguish from upcoming
hevent^Whangup_event.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
The generic epoll_wait wrapper should not do any assumptions about timeout.
It's it up to lazy-pages daemon to make (future) policy decisions.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If ipv6 socket has an IPv4-mapped address, it is used to handle ipv4
connection, so we have to use ipv4 iptables rules to block this
connection.
Reported-by: Mr Jenkins
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In addition to writing the CRIU version to the log file this adds the
current kernel version to the log file:
(00.000008) Version: 3.5 (gitid v3.5-511-ga8cc6cf)
(00.000303) Running on node01 Linux 3.10.0-513.el7.x86_64 #1 SMP Tue Feb 29 06:78:90 EST 2017 x86_64
v2:
- small changes as suggested by Dmitry (thanks)
Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Leases can be set only on regular files. Thus, as optimization we can
skip attempts to find associated leases in 'correct_file_leases_type'
for other fd types.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
-- check childs' errors in file_leases03
-- test c/r of lease transfered to child process
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
CRIU creates dictinct lock record for each file descriptor on the same
OFD. The patch removes this duplicates. To do so, it adds new field into
struct file_lock, which stores pid of fd, on which lock was found.
'owner pid' is not actually helpful, because the original fd, on which
lock have been set, can be already closed.
Also it purges crutches doing the same stuff but only for file leases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
If we receive only part of the IOV from the page-server we recalculate the
IOV so it will point to the area we still have to fetch. During the split,
the IOV covering the remaining area may remain marked as 'queued' and we'll
never retry fetching it.
Marking the IOV as not queued will ensure its pages will be requested
again.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Since commit e609267f681062b4370e528a50f635222e0c2330 ("page-pipe: allow to
share pipes between page pipe buffers") the assumption that we will receive
the exact amount of pages we've requested with PS_IOV_GET does not always
hold.
In the case we serve pages data from the images using 'page-server
--lazy-page' the IOVs seen by the pagemap may cross page-pipe buffer
boundaries and read_page_pipe will clamp the pages in the response to those
boundaries.
Adjust page_server_read so it will not try to receive more pages than
page-server is going to send.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>