We're going to split interconnected pair restore
on two stages. Since we need the second end
to restore message queue in (future) post open,
we add it to the process, who is owner of the first
end.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
This functional allows to make a fle a master on
the time of collection. We will use it to add fake
files when we need to do this after add_fake_fds_masters().
This will be used to add second end of socketpair as
a fake fle (as the first end is placed in the right
place, we will force add the second end there).
See next patches.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Since this function is used by standalone sockets only,
we move it to appropriate place. No functional changes.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Since epoll restore is split in two parts,
epoll_create() does not depend on another
files state. Since epoll is created, it
can be sent to everywhere. So, there is
no circular dependences, and we allow epolls
sent over unix socket.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Create socketpair and epoll. Add one end of the socketpair
to epoll and then twice send it over another end.
After restore check, that epoll can be received
via socket, and that it contains event.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
$ make lint
flake8 --config=scripts/flake8.cfg test/zdtm.py
test/zdtm.py:323:19: F841 local variable 'e' is assigned to but never used
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
waitpid() does not return child pid, when child has not exited.
So, we can't use it to find pids of children.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
close_safe can operate uninitialized fd in case of error in switch_ns,
found by Coverity Scan:
*** CID 187164: Uninitialized variables (UNINIT)
/criu/mount.c: 1313 in open_mountpoint()
1307 err:
1308 return 1;
1309 }
1310
1311 int open_mountpoint(struct mount_info *pm)
1312 {
>>> CID 187164: Uninitialized variables (UNINIT)
>>> Declaring variable "fd" without initializer.
1313 int fd, cwd_fd, ns_old = -1;
1314
1315 /* No overmounts and children - the entire mount is visible */
1316 if (list_empty(&pm->children) && !mnt_is_overmounted(pm))
1317 return __open_mountpoint(pm, -1);
1318
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Start test
./mxcsr --pidfile=mxcsr.pid --outfile=mxcsr.out
Run criu dump
Unable to kill 44: [Errno 3] No such process <--------------- this one
Run criu restore
Run criu dump
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Write a nullbyte only if there is enought space for it.
Cc: Stephen Röttger <stephen.roettger@gmail.com>
Reported-by: Stephen Röttger <stephen.roettger@gmail.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
We print errors in all error cases when calling linkat_hard anyway, but
for some errors like EEXIST we are fine and just skip them, so we should
not print error here.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
It is a test for convert_path_from_another_mp fix, it is a bit tricky as
we don't fully support ghosts on readonly fs, but only if the ghost can
be remaped on some _other_ bindmount (luckily we have same ghost on other
bind). Moreover wrong absolute path generated with old convert_path_from
_another_mp for lnkat don't always fail, only in case we want to do
linkat on mount in _other_ mountns and absolute path makes us do it in
local mountns and local path is readonly and we fail. =)
v2: remove unused headers
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
If dmi->ns_mountpoint is "/" then in dst we will return "/..." -
absolute path but we want here path relative to dmi mount. Adding "./"
before the path guaranties that it will be always relative.
https://jira.sw.ru/browse/PSBM-72351
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Fork tasks and create fds with different numbers.
Some children share file with parent (CLONE_FILES).
Check, than we can suspend and resume in this case.
v2: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Currently, we set rlim(RLIMIT_NOFILE) unlimited
and service_fd_rlim_cur to place service fds.
This leads to a signify problem: every task uses
the biggest possible files_struct in kernel, and
it consumes excess memory after restore
in comparation to dump. In some situations this
may end in restore fail as there is no enough
memory in memory cgroup of on node.
The patch fixes the problem by introducing
task-measured service_fd_base. It's calculated
in dependence of max used file fd and is placed
near the right border of kernel-allocated memory
hunk for task's fds (see alloc_fdtable() for
details). This reduces kernel-allocated files_struct
to 512 fds for the most process in standard linux
system (I've analysed the processes in my work system).
Also, since the "standard processes" will have the same
service_fd_base, clone_service_fd() won't have to
actualy dup() their service fds for them like we
have at the moment. This is the one of reasons why
we still keep service fds as a range of fds,
and do not try to use unused holes in task fds.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
v2: Add a handle for very big fd numbers near service_fd_rlim_cur.
v3: Fix excess accounting for nr equal to pow 2 minus 1.
In normal life this is impossible. But in case of big
fdt::nr number (many processes, sharing the same files),
and custom service_fd_base, normal (!CLONE_FILES) child
of such process may have overlaping service fds with
parent's fdt. This patch introduces "memmove()" behavior
(currently there is "memcpy()" behavior) and this will
be used in next patch.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
This patch just moves part of clone_service_fd()
to separate function, that change readability of the code.
There are no functional changes, only refactoring.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
This patch makes the call of service fds relocation after
root_prepare_shared()->prepare_fd_pid(). Next patches
will make service_fd_base depend on task's max fd used,
and for root_item we need to read all fles to know
the maximum of them.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Next patches will make service_fd_base not contant.
It will be "floating" and change from task to task.
This patch makes preparation for that: it closes
old service fd after it's duplicated.
Currently the code is unused as in case of
!(rsti(me)->clone_flags & CLONE_FILES), the child
has the same id as its parent, and the duplication
just does not occur.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Next patches will make service fds numbers not connected
not rlimit. Change the name to better fit its goal.
Also, leave service_fd_rlim_cur variable to have cached
access to rlimit value.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
v2: More comments
This patch populates/occupies PROC_FD_OFF fd number,
which is goint to be replaced atomically in next patches.
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
This patch populates and occupies ROOT_FD_OFF fd,
which guarantees it won't be reused by ordinary fds.
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
This patch introduces sfds_protected, which allows
to mask areas, where modifications of sfds are prohibited.
That guarantees, that populated sfds won't be reused.
v4: New
v5: Add comment and print sfd type before BUG().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Function to print call trace of a process.
Borrowed from this fm:
https://www.gnu.org/software/libc/manual/html_node/Backtraces.html
backtrace() and backtrace_symbols() are not implemented in alpine,
so we use __GLIBC__ ifdef to do not compile this function there.
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
1)Further mntns_set_root_fd() calls install_service_fd(),
which silently closes already open fd. So, kill close_service_fd()
and make __mntns_get_root_fd() atomical in ROOT_FD_OFF modifications.
2)close_pid_proc() is not need here, as it's about root_item's
/proc directory and __mntns_get_root_fd() actions don't act on it.
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Here we need to close proc self fd only, as it's not
a service fd, and it can occupy real task fd number.
Closing of PROC_PID_FD_OFF is useless action here,
because it's already occupy a service fd number.
So, we skip this excess syscall, and leave PROC_PID_FD_OFF
open.
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Next patch will move SELF_STDIN_OFF sfd to fdstore.
This patch moves fdstore_init() before tty_prep_fds().
v4: New
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Create a zombie with specific pgid and check that
pgid remains the same after restore.
This test hangs criu restore without any of two previous patches:
1)without "restore: Call prepare_fds() in restore_one_zombie()"
in 100% cases;
2)without "restore: Split restore_one_helper() and wait exiting
zombie children" fail is racy, but you can add something like
criu/cr-restore.c:
## -1130,6 +1130,8 @@ static int restore_one_zombie(CoreEntry *core)
if (task_entries != NULL) {
restore_finish_stage(task_entries, CR_STATE_RESTORE);
+ if (current->parent->pid->state == TASK_ALIVE)
+ sleep(2);
zombie_prepare_signals();
}
and it will fail with almost 100% probability.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Zombie is also can be choosen as a parent for task helper like
any other task.
If the task helper exits between restore_finish_stage(CR_STATE_RESTORE)
and zombie_prepare_signals()->SIG_UNBLOCK, the standard criu SIGCHLD
handler is called, and the restore fails:
(00.057762) 41: Error (criu/cr-restore.c:1557): 40 exited, status=0
(00.057815) Error (criu/cr-restore.c:2465): Restoring FAILED.
This patch makes restore_one_zombie() behave as restore_one_helper()
and to wait children exits before allowing SIGCHLD. This makes us
safe against races with exiting children.
See next patch for test details.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Zombie may be choosen as parent for task helper
during solving pgid dependences. In this situation,
it becomes to share fdt with the helper and it has
to call prepare_fds() to decrement fdt->nr.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
There are two problems. The first is CTL_TTY_OFF occupies
one of the biggest available fds in the system. It's a number
near service_fd_rlim_cur. Next patches want to allocate
service fds lower, than service_fd_rlim_cur, and they want
to know max used fd from file fles after the image reading.
But since one of fds is already set very big (CTL_TTY_OFF)
on a stage of collection fles, the only availabe service
fds are near service_fd_rlim_cur. It's vicious circle,
and the only way is to change ctl tty fd allocation way.
The second problem is ctl tty is ugly out of generic file
engine fixup (see open_fd()). This is made because ctl tty
is the only slave fle, which needs additional actions
(see tty_restore_ctl_terminal()). Another file types just
receive their slave fle, and do not do anything else.
This patch moves ctl tty to generic engine and solves all
the above problems. To do that, we implement new CTL_TTY
file type, which open method waits till slave tty is received
and then calls tty_restore_ctl_terminal() for that. It fits
to generic engine well, and allocates fd via find_unused_fd(),
and do not polute file table by big fd numbers.
Next patch will kill currently unneed CTL_TTY leftovers
and will remove CTL_TTY_OFF service fd from criu.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>