Fix for commit 0ce8e4299506 ("kerndat: do not report errors on feature
test").
That commit hid error messages for feature testing when you cannot
write to /proc/*/loginuid files because of missing kernel patch that
allows unsetting loginuid value on older kernels, but it didn't hide
error messages in case of disabled CONFIG_AUDITSYSCALL - then you
don't have loginuid files.
Also fixed comment for kerndat feature test: procfs file might fail
to open if it's missing and that's fine - !CONFIG_AUDITSYSCALL case,
but it can't fail due permission fault on _read_ (then something is
wrong, lets report a problem).
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
We request all contracks via netlink and save netlink messages which
describe them in an image file, then we send these netlink messages back on restore.
https://github.com/xemul/criu/issues/54
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Stas found that if we don't align a pointer,
futex and atomic operations can fail.
v2: don't hard-code the size of void *
v3: add a function to allocate memory without gaps with
a privious slice. It's used to allocate arrays.
v4: don't change rst_mem_cpos
Cc: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
crtools binary is linked with the C library and could rely on all the
services this library is providing, including system calls.
Thus it doesn't need to be linked with the builtin system calls code
made for the parasite/restorer binaries.
This patch does:
- remove the inclusion of syscall.h
- replace all call to sys_<syscall>() by C library <syscall>()
- replace unwrapped system calls by syscall(SYS_<syscall>,...)
- fix the generated compiler's issues.
There should not be any functional changes. The only 'code' changes is
appearing in locks.h when futex is called through the C library, the
errno value is fetched from errno variable instead of the return
value.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The CRIU internal define of CLONE_SUBNS should not be put in
syscall-types.h since this define is not part of a system call.
This move is required to prepare the removal of syscall.h from the
component of crtools binary.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
The file include/prctl.h should define the struct prctl_mm_map only if
it is not already defined in the system include file linux/prctl.h.
The definition should be part of the '#ifndef PR_SET_MM_MAP' block
since this structure is not defined in that case.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
It is missged in first place and may cause
problem on exiting via alarm hanling.
Reported-by: Igor Sukhih <igor@virtuozzo.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
AutoFS will need to create write pipe end file descriptor, if it was closed.
Thus, pipe_info structure have to be exported.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
There's only one user of it, so better to reshuffle the arg set.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Introduce post-setup-namespaces action script
It needed to have possibility to run cutom script after mount
namespace is configured
Signed-off-by: Igor Sukhih <igor@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Recent kernels allow for user to read proc pagemap file, but zero
pfns in it. Support this mode for user dumps.
https://github.com/xemul/criu/issues/101
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Replace stack alignment magic constant with
__stack_aligned__ macro.
Also align stack for sigaltstack test case.
Signed-off-by: Vijaya Kumar K <vijayak@caviumnetworks.com>
Reviewed-by: Christopher Covington <cov@codeaurora.org>
Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
arm64 requires stack to be aligned to 16 bytes.
update RESTORE_ALIGN_STACK macro to always align
to 16 bytes.
Signed-off-by: Vijaya Kumar K <vijayak@caviumnetworks.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
prepare_loginuid() called on kerndat_loginuid where it tests for
loginuid restore feature. Let's omit error printing for feature test.
Signed-off-by: Dmitry Safonov <dsafonov@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently if criu segfaulted, the inventory image isn't removed and
we can't detect that images are incomplete.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now we can use the --inherit-fd option to mark external terminals on dump
and to tell which file desdriptors should be used to restore these terminals.
Here is an example how it works:
$ setsid sleep 1000
$ ipython
In [1]: import os
In [2]: st = os.stat("/proc/self/fd/0")
In [3]: print "tty[%x:%x]" % (st.st_rdev, st.st_dev)
tty:[8800:d]
$ps -C sleep
PID TTY TIME CMD
4109 ? 00:00:00 sleep
$ ./criu dump --external 'tty[8800:d]' -D imgs -v4 -t 4109
$ ./criu restore --inherit-fd 'fd[1]:tty[8800:d]' -D imgs -v4
v2: add missed break
remove @non_file from tty_driver
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This option is used to mark external resources on dump.
Currently it's going to be used to handle external tty-s,
but in a future it can be used to any type of resources.
We can have a few ways to restore external resources and
we will have a separate options to say how to restore each type.
For example, we can use --inherit-fd to restore external
file descriptors.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We can't use only a terminal device, because we can not distinguish
two pty-s from different mounts in this case.
$ mount -t devpts -o newinstance xxx pts1
$ mount -t devpts -o newinstance xxx pts2
$ stat pts1/0
Device: 27h/39d Inode: 3 Links: 1 Device type: 88,0
$ stat pts2/0
Device: 28h/40d Inode: 3 Links: 1 Device type: 88,0
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
timer_t is (void *) in glibc, but timer_t is (int) in kernel.
When we call system calls, we need to use timer_t from kernl.
https://github.com/xemul/criu/issues/98
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This value will differ on C/R:
- on checkpoint it means that it's possible to dump logiuid values;
- on restore it means that it's possible to unset loginuid and write
saved value to unsetted loginuid.
Signed-off-by: Dmitry Safonov <dsafonov@odin.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We use page frame number to detect vDSO which has been remapped
in-place from runtime vDSO during restore. In such case if the
kernel is younger than 3.16 the "[vdso]" mark won't be reported
in procfs output.
Still to address recently reported CVEs and be able to run CRIU
in unprivileged mode we need to handle vDSO without pagemap access
and here is the deal -- when we find VMA which "looks like" vDSO
we try to scan it for vDSO symbols and if it matches we restore
its status without PFN access.
Here is some details on @pagemap access in-kernel history:
- @pagemap introduced in commit 85863e475e59 where anyone
which can attach to a task via ptrace is allowed to read
data from @pagemap (Feb 4 2008, v2.6.25-rc1)
- in commit 006ebb40d3d65 ptrace attach rule has been changed
into ptrace read permission (May 19 2008, v2.6.27-rc1)
- in commit ab676b7d6fbf4 opening of @pagemap become guarded
with CAP_SYS_ADMIN because of leak of physical addresses
into userspace (Mar 9 2015, v4.0-rc5)
- in commit 1c90308e7a77a opening of @pagemap become available
for regular users again (with ptrace read permission) but
physical addresses of pages are hidden from non-privileged
userd (Sep 8 2015, v4.3-rc1)
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When run from regular user criu will get EACCES/EPERM from
opening proc, but in some situations criu will now how to
deal with it. So this patch makes it possible not to print
error message in logs for such cases.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
We no longer support root-mode service and suid binaries, so
any artificial restrictions no longer make sense.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Looks-good-to-me: Andrew Vagin <avagin@virtuozzo.com>
This as well as restore requires several steps to reach per-thread
support during dump stage
- @creds area to be fetched from the parasite is embedded into
parasite_dump_structure
- when test for task to be dumpable we no longer compare caps
because we now allow them to be different (and I renamed
proc_status_creds_eq to proc_status_creds_dumpable for this
sake)
- have to extend dump_thread_common to support dumping of
creds (we call for dump_thread_common in several places,
in particular when we need to fetch misc params we don't
need creds, here @creds option comes into the play)
- after this patch no creds-X.img file be generated anymore,
I guess we might drop it off with time from descriptors
https://jira.sw.ru/browse/PSBM-41416
v2:
- In dump_task_creds() don't mangle the call for parasite_dump_creds
and collect_lsm_profile
- PARASITE_MAX_GROUPS takes parasite_dump_thread into account because
dump_thread_common now serves two cases: for plain misc parameters
fetching and for creds as well (depending on the context)
- when test for dumpable we still require the seccomp filters
to match, they can be different and we need to support such
configuration too but not in this series
v3:
- Rip off dump_task_creds completely, together with PARASITE_CMD_DUMP_CREDS,
we dump creds unconditionally in dump_thread_common
- the group leader thread data is fetched via new
parasite_dump_thread_leader_seized helper
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Because the creds parameters are to be passed inside pie/restorer
code but read before thread_restore_args and task_restore_args
structures are allocated we need a small trick and prepare
creds int several stages
- collect all creds data into separate private memory blobs
- once all memory needed for restorer is allocated we relocate
pointers in this blocks and setup
thread_restore_args::thread_creds_args to appropriate
address
- restorer works as usual and setup creds parameters as before
v2:
- fix addressing in positioning of rst_ memory (I've occasionally
zap pointers and when been sending patches forgot to merge changes
back, so while I've the series successfully restoring containers
with different creds, if been merged the series won't work. So
all changes are merged as appropriate)
- drop module's global @cap_last_cap from pie/restorer.c
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For easier comparision which gonna be addressed in next patch.
https://jira.sw.ru/PSBM-41416
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Similar to devtmpfs and devpts, skip binfmt_misc
mount if it's not virtual.
Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Acked-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently criu dump may hang indefinitely. E.g. in wait for task
that blocked in vfork() or task could be in D state for some other
reason. This patch adds time limit on collecting tasks during the
dump operation. If collecting processes takes too long, the dump
process will be terminated. Timeout is 5 seconds by default, but
it could be changed via parameter.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch brings add_to_string() and construct_string() helpers.
They allow to create a string with variable amount of parameters in sprintf()
manner, but supporting string allocation (and reallocation if necessary)
v2:
1) Helpers were renamed to xstrcat() and xsprintf() respectively.
2) Added printf attributes to force compiler check
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Patch restores freezer cgroup state between finalize_restore stages.
It should be done after first stage because we cannot unmap restorer blob
from frozen process, and before second stage because we must freeze processes
before they continue run.
We also need to move fini_cgroup between these stages to provide freezer
cgroup state restorer access to cgroup mount directories.
Error handlers contains fini_cgroup, so we are sure that fini_cgroup call
won't be missed.
Patch restores state only for one freezer cgroup from --freeze-cgroup option,
not all states from whole hierarchy, because CRIU supports checkpoint from
freezer cgroup hierarchy only with THAWED state, except root cgroup from
--freeze-cgroup option.
Signed-off-by: Evgeniy Akimov <geka666@gmail.com>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CRIU sets freezer.state to "THAWED" during process tree dumping. That's why
we can't simply save freezer.state file contents to cgroups image. New
special function get_real_freezer_state() returns freezer cgroup state
observed before CRIU dumping start. Patch puts its return value to dump file.
Signed-off-by: Evgeniy Akimov <geka666@gmail.com>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This will be required for page-cache and page-proxy set.
Signed-off-by: Rodrigo Bruno <rbruno at gsd.inesc-id.pt>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A freezer cgroup can contain tasks which will be not dumped,
criu unfreezes the group, so we need to freeze all extra
task with ptrace like we do for target tasks.
Currently we attache and send an interrupt signals to these tasks,
but we don't call waitpid() for them, so then waitpid(-1, ...)
returns these tasks where we don't expect to see them.
v2: execute freezer_detach() only if opts.freeze_cgroup is set
calculate extra tasks in a freezer cgroup correctly
v3: s/frozen_processes/processes_to_wait/
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used to mount AutoFS, because context creation is required in
addition to actual mount operation.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch introduces three helpers:
1) pstree_item_by_real() - search for pstree item by real pid.
2) pstree_item_by_virt() - search for pstree item by virtual pid.
3) pid_to_virt() - return virtual pis by real one.
Note: pstree_item_by_virt() and pid_to_virt() will be used to migrate AutoFS.
Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently we wait when a namespace will be restored to get its root.
We need to open a namespace root to open a file to restore a memory mapping.
A process restores mappings and only then forks children. So we can have
a situation, when we need to open a file from a namespace, which will be
"restored" by one of our children.
The root task restores all mount namespaces and opens a file descriptor
for each of them. In this patch we open root for each mntns in the root
task.
If we neeed to get root of a namespace which isn't populated, we can get
it from the root task. After the CR_STATE_FORKING stage, the root task
closes all namespace descriptors ane we know that all namespaces are
populated at this moment.
v2: don't close root_fd for root ns, because it was not opened
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We need to perform dirty page tracking when dumping shmem but there
we have only const vmas so we need pmc to work with them. Also pmc concept
implies that it won't change its vmas so it would be natural to declared
them as const.
Signed-off-by: Fyodor Bocharov <fbocharov@yandex.ru>
Signed-off-by: Eugene Batalov <eabatalov89@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In LXD, we use the container name in the LSM profile. If the container name
is changed on migrate (on the host side), we want to use a different LSM
profile name (a. la. --cgroup-root). This flag adds that support.
v2: remove unused field, add comment about double detection in
kerndat_lsm()
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
When we're restoring fsnotify watchees we need to resolve
path to a handle at some mountpoint referred by @s_dev
member (device ID) which is saved inside image. This
ID actually may be changed at the every mount (say
one restores container after machine reboot) or in
case of container's migration.
Thus the test for overmounting in __open_mountpoint
will fail and we get an error.
Lets do a trick: introduce @s_dev_rt member which
is supposed to carry run-time device ID. When dumping
this member simply equal to traditional @s_dev fetched
from the procfs, but when restoring we fetch it from
stat call once mountpoint become alive.
https://jira.sw.ru/browse/PSBM-41610
v2:
- predefine MOUNT_INVALID_DEV
- use fetch_rt_stat instead of assigning device in restore_shared_options
- copy @s_dev_rt in propagate_siblings and propagate_mount
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This patch implements checkpoint/restore functionality
for binfmt_misc mounts. Both magic and extension types
and "disabled" state are supported.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Due to security reasons the systemd-spawn mode is no longer
supported in service.
Also fix the default binding address to be in local cwd not
to start global service by chance.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We found that we want to know whether SIGSTOP is queue
in both or is in one of this queues.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>