The amdgpu plugin would create a memory buffer at the size
of the largest VRAM bo (buffer object). On some systems, VRAM
size exceeds RAM size, so the largest bo might be larger than
the available memory.
Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the
size of this buffer. By default, it is set to 0, and has no
effect. When active, any bo larger than its value will be
saved to/restored from file in multiple passes.
Signed-off-by: David Francis <David.Francis@amd.com>
This commit removes the checks for the Python 2 binary in the makefile
and makes sure that ZDTM tests always use python3. Since support for
Python 2 has been dropped, these checks are no longer needed.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
New 'query-ext-files' action for `criu dump` is sent after
freezing the process tree. This allows to defer gathering
the external file list when the process tree is in a stable
state and avoids race with the process creating and deleting
files.
Change-Id: Iae32149dc3992dea086f513ada52cf6863beaa1f
Signed-off-by: Michał Mirosław <emmir@google.com>
Make it possible to skip network lock to enable uses that break connections
anyway to work without iptables/nftables being present.
Signed-off-by: Michał Mirosław <emmir@google.com>
By default, the file name 'amdgpu_plugin.txt' is used also as the name
for the corresponding man page (`man amdgpu_plugin`). However, when
this man page is installed system-wide it would be more appropriate
to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`).
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --ghost-fiemap option was introduced with #1963.
It enables an optimized algorithm based on fiemap ioctl that can reduce
the number of syscalls used to checkpoint highly sparse ghost files. This
option is enabled by default. It can be disabled with --no-ghost-fiemap
when using SEEK_HOLE/SEEK_DATA is preferred. In addition, an automatic
fallback to SEEK_HOLE/SEEK_DATA is used for filesystems that do not
supporting fiemap.
Co-authored-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This adds the non-root section and information about the parameter
--unprivileged to the man page.
Co-authored-by: Anna Singleton <annabeths111@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Anna Singleton <annabeths111@gmail.com>
A file's r/w/x changing between checkpoint and restore does
not necessarily imply that something is wrong. For example,
if a process opens a file having perms rw- for reading and
we change the perms to r--, the process can be restored and
will function as expected.
Therefore, this patch adds an option
--skip-file-rwx-check
to disable this check on restore. File validation is unaffected
and should still function as expected with respect to the content
of files.
Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
Add SIGTSTP signal dump and restore. Add a corresponding field
in the image, save it only if a task is in the stopped state.
Restore task state by sending desired stop signal if it is present
in the image. Fallback to SIGSTOP if it's absent.
Signed-off-by: Yuriy Vasiliev <yuriy.vasiliev@openvz.org>
Modifications to support criu image streamer when using amdgpu_plugin.
When running with criu image streamer, fseek/lseek is not available so
we store the file size in the first 8-bytes of the actual file.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Store BO contents directly to file (1 per GPU) instead of using
protobuf.
Bug Fix:
Fixes an issue where we could not handle BOs bigger than 4GB because
protobuf has an internal limit of 4GB for the Bytes structure.
Performance Improvements:
This significantly reduces CR duration on multi-GPU systems as it allows
reading and writing to disk in parallel. During checkpoint, instead of
waiting for all the BO contents to be read from the one protobuf file,
we can now start writing the BO contents as soon as the first BO is read
from disk. During restore, we can start writing BO contents to disk
after the first BO from VRAM. This also reduces the peak amount of
system memory used as we only need to keep 1 BO content in memory per
GPU at a time instead of all the BO contents.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add optional parameters to override default behavior during restore.
These parameters are passed in as environment variables before executing
CRIU.
List of parameters:
KFD_FW_VER_CHECK - disable firmware version check
KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check
KFD_CACHES_COUNT_CHECK - disable caches count check
KFD_NUM_GWS_CHECK - disable num_gws check
KFD_VRAM_SIZE_CHECK - disable VRAM size check
KFD_NUMA_CHECK - preserve NUMA regions
KFD_CAPABILITY_CHECK - disable capability check
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
The device topology on the restore node can be different from the
topology on the checkpointed node. The GPUs on the restore node may
have different gpu_ids, minor number. or some GPUs may have different
properties as checkpointed node. During restore, the CRIU plugin
determines the target GPUs to avoid restore failures caused by trying
to restore a process on a gpu that is different.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce
a new plugin to assist CRIU with the help of AMD KFD kernel driver. This
initial commit just provides the basic framework to build up further
capabilities. Like CRIU, the amdgpu plugin also uses protobuf to
serialize
and save the amdkfd data which is mostly VRAM contents with some
metadata.
We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore
this file is read and extracted to re-create various types of buffer
objects that belonged to the previously checkpointed process. Upon
restore the mmap page offset within a device file might change so we use
the new hook to update and adjust the mmap offsets for newly created
target process. This is needed for sys_mmap call in pie restorer phase.
Support for queues and events is added in future patches of this series.
With the current implementation (amdgpu_plugin), we support:
- Only compute workloads such (Non Gfx) are supported
- GPU visible inside a container
- AMD GPU Gfx 9 Family
- Pytorch Benchmarks such as BERT Base
amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically
installed with libdrm-dev package. We build amdgpu_plugin only when the
dependencies are met on the target system and when user intends to
install the amdgpu plugin and not by default with criu build.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Co-authored-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
The --timeout option was introduced in [1] to prevent criu dump from
being able to hang indefinitely and allow users to adjust the time limit
in seconds for collecting tasks during the dump operation.
[1] https://github.com/checkpoint-restore/criu/commit/d0ff730
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
The expected behavior of --tcp-close option when dumpping is to close
all established tcp connections including connection that is once
established but now closed. This adds an explicit description about
that behavior.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Support for external net namespaces has been introduced with
commit c2b21fbf (criu: add support for external net namespaces).
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
Python 2 has been deprecated since January 1, 2020 and linux distributions
already support Python 3. Thus, to simplify maintenance and packaging
we could support criu-ns as Python 3 only.
v2: Add a message for criu-ns installation
Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
This adds the option to choose the networking locking method.
CRIU currently uses iptables-restore cli for network locking/unlocking
but nftables support will be added later.
There have been reports from users that iptables-restore fails in some
way and an nftables based approach using libnftables could avoid this
external dependency.
v2: remove dependency details in man page for --network-lock.
v3: remove --network-lock from restore section in docs because it is
automatically detected from the inventory image now.
v4: add message that --network-lock will be ignored during restore
and value from dump will be used.
v5: run make indent
Signed-off-by: Zeyad Yasser <zeyady98@gmail.com>
This change is motivated by checkpointing and restoring container in
Pods.
When restoring a container into a new Pod the SELinux label of the
existing Pod needs to be used and not the SELinux label saved during
checkpointing.
The option --lsm-profile already enables changing of process SELinux
labels on restore. If there are, however, tmpfs checkpointed they
will be mounted during restore with the same context as during
checkpointing. This can look like the following example:
context="system_u:object_r:container_file_t:s0:c82,c137"
On restore we want to change this context to match the mount label of
the Pod this container is restored into. Changing of the mount label
is now possible with the new option --mount-context:
criu restore --mount-context "system_u:object_r:container_file_t:s0:c204,c495"
This will lead to mount options being changed to
context="system_u:object_r:container_file_t:s0:c204,c495"
Now the restored container can access all the files in the container
again.
This has been tested in combination with runc and CRI-O.
Signed-off-by: Adrian Reber <areber@redhat.com>
file_validation_method field added to cr_options structure in
"criu/include/cr_options.h" along with the constants:
FILE_VALIDATION_FILE_SIZE
FILE_VALIDATION_BUILD_ID
FILE_VALIDATION_DEFAULT (Equal to FILE_VALIDATION_BUILD_ID)
Usage and description information is yet to be added
Usage:
--file-validation="filesize" (To use only the file size check)
--file-validation="buildid" (To try and use only the build-id check)
Signed-off-by: Ajay Bharadwaj <ajayrbharadwaj@gmail.com>
This adds the ability to stream images with criu-image-streamer
The workflow is the following:
1) criu-image-streamer is started, and starts listening on a UNIX
socket.
2) CRIU is started. img_streamer_init() is invoked, which connects to the
socket. During dump/restore operations, instead of using local disk to
open an image file, img_streamer_open() is called to provide a UNIX pipe
that is sent over the UNIX socket.
3) Once the operation is done, img_streamer_finish() is called, and the
UNIX socket is disconnected.
criu-image-streamer can be found at:
https://github.com/checkpoint-restore/criu-image-streamer
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
The file only includes other headers (which may be not needed).
If we aim for one-include-for-compel, we could instead paste all
subheaders into "compel.h".
Rather, I think it's worth to migrate to more fine-grained compel
headers than follow the strategy 'one header to rule them all'.
Further, the header creates problems for cross-compilation: it's
included in files, those are used by host-compel. Which rightfully
confuses compiler/linker as host's definitions for fpu regs/other
platform details get drained into host's compel.
Signed-off-by: Dmitry Safonov <dima@arista.com>
This option was introduced with:
e2c38245c6
v2: (comment from Pavel Tikhomirov) --enable-fs does not fit with
--external dev[]:, see try_resolve_ext_mount, external dev mounts
only determined for FSTYPE__UNSUPPORTED.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Commit 0493724c8eda3 added support for using asciidoctor
(instead of asciidoc + xmlto) to generate man pages.
For some reason, asciidoctor does not deal well with some
complex formatting that we use for options such as --external,
leading to literal ’ and ' appearing in the man page instead
of italic formatting. For example:
> --inherit-fd fd[’N']:’resource'
(here both N and resource should be in italic).
Asciidoctor documentation (asciidoctor --help syntax) tells:
> == Text Formatting
>
> .Constrained (applied at word boundaries)
> *strong importance* (aka bold)
> _stress emphasis_ (aka italic)
> `monospaced` (aka typewriter text)
> "`double`" and '`single`' typographic quotes
> +passthrough text+ (substitutions disabled)
> `+literal text+` (monospaced with substitutions disabled)
>
> .Unconstrained (applied anywhere)
> **C**reate+**R**ead+**U**pdate+**D**elete
> fan__freakin__tastic
> ``mono``culture
so I had to carefully replace *bold* with **bold** and
'italic' with __italic__ to make it all work.
Tested with both terminal and postscript output, with both
asciidoctor and asciidoc+xmlto.
TODO: figure out how to fix examples (literal multi-line text),
since asciidoctor does not display it in monospaced font (this
is only true for postscript/pdf output so low priority).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Add a/the articles where I see them missing
2. s/Forbid/disable/
3. s/crit/crit(1)/ as we're referring to a man page
4. Simplify some descriptions
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case asciidoc is installed and xmlto is not, make returns an error
but there's no diagnostics shown, since "xmlto: command not found"
goes to /dev/null.
Remove the redirect.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The original/old guide probably doesn't work anymore:
- the patch isn't accessible;
- criu now depends on more libraries not only protobuf
Still, keep it as it might be helpful for someone.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Two modes of pre-dump algorithm:
1) splicing memory by parasite
--pre-dump-mode=splice (default)
2) using process_vm_readv syscall
--pre-dump-mode=read
Signed-off-by: Abhishek Dubey <dubeyabhishek777@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Instead of creating cgroup yard in CRIU, now we can create it externally
and pass it to CRIU. Useful if somebody doesn't want to grant
CAP_SYS_ADMIN to CRIU.
Signed-off-by: Michał Cłapiński <mclapinski@google.com>
This commit adds Transport Layer Security (TLS) support for remote
page-server connections.
The following command-line options are introduced with this commit:
--tls-cacert FILE Trust certificates signed only by this CA
--tls-cacrl FILE CA certificate revocation list
--tls-cert FILE TLS certificate
--tls-key FILE TLS private key
--tls Use TLS to secure remote connections
The default PKI locations are:
CA certificate /etc/pki/CA/cacert.pem
CA revocation list /etc/pki/CA/cacrl.pem
Client/server certificate /etc/pki/criu/cert.pem
Client/server private key /etc/pki/criu/private/key.pem
The files cacert.pem and cacrl.pem are optional. If they are not
present, and not explicitly specified with a command-line option,
CRIU will use only the system's trusted CAs to verify the remote
peer's identity. This implies that if a CA certificate is specified
using "--tls-cacert" only this CA will be used for verification.
If CA certificate (cacert.pem) is not present, certificate revocation
list (cacrl.pem) will be ignored.
Both (client and server) sides require a private key and certificate.
When the "--tls" option is specified, a TLS handshake (key exchange)
will be performed immediately after the remote TCP connection has been
accepted.
X.509 certificates can be generated as follows:
-------------------------%<-------------------------
# Generate CA key and certificate
echo -ne "ca\ncert_signing_key" > temp
certtool --generate-privkey > cakey.pem
certtool --generate-self-signed \
--template temp \
--load-privkey cakey.pem \
--outfile cacert.pem
# Generate server key and certificate
echo -ne "cn=$HOSTNAME\nencryption_key\nsigning_key" > temp
certtool --generate-privkey > key.pem
certtool --generate-certificate \
--template temp \
--load-privkey key.pem \
--load-ca-certificate cacert.pem \
--load-ca-privkey cakey.pem \
--outfile cert.pem
rm temp
mkdir -p /etc/pki/CA
mkdir -p /etc/pki/criu/private
mv cacert.pem /etc/pki/CA/
mv cert.pem /etc/pki/criu/
mv key.pem /etc/pki/criu/private
-------------------------%<-------------------------
Usage Example:
Page-server:
[src]# criu page-server -D <PATH> --port <PORT> --tls
[dst]# criu dump --page-server --address <SRC> --port <PORT> \
-t <PID> -D <PATH> --tls
Lazy migration:
[src]# criu dump --lazy-pages --port <PORT> -t <PID> -D <PATH> --tls
[dst]# criu lazy-pages --page-server --address <SRC> --port <PORT> \
-D <PATH> --tls
[dst]# criu restore -D <PATH> --lazy-pages
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Since commit 6c572bee8f10 ("cgroup: Set "soft" mode by default") it
become impossible to set ignore mode at all. Provide a user option to do
that.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
* "post-resume" was introduced with commit:
2ab599398ddbde3449f1b9d4d3f5152591854cff
cr-restore: "post-resume" hook introduced
This hook is called at the very end, when everything is restored and processes
were resumed.
Can be used for some actions, which require operation container, like
restarting of systemd autofs services.
* "post-setup-namespaces" was introduced with commit:
eec66f3d30f9ccd75a8f1fab6920c20933eecd64
criu [PATCH] post-setup-namespaces
Introduce post-setup-namespaces action script
It needed to have possibility to run cutom script after mount
namespace is configured
* "orphan-pts-master" was introduced with commit:
6afe523d97d59e6bf29621b8aa0e6a4332f710fc
tty: notify about orphan tty-s via rpc
Now Docker creates a pty pair from a container devpts to use is as console.
A slave tty is set as a control tty for the init process and bind-mounted
into /dev/console. The master tty is handled externelly.
Now CRIU can handle external resources, but here we have internal resources
which are used externaly.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Since asciidoc is based on Phyton 2, we want to move to alternative,
and a promising one is asciidoctor. This patch allows to use
asciidoctor for formatting man pages instead of asiidoc, by passing
a make option, USE_ASCIIDOCTOR=yes.
Although asciidoctor is almost compatible with asciidoc, it can
produce a man page directly from a text file without XML, which is
more efficiently. So in asciidoctor mode, we don't require xmlto.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
a2x is never used although its presence is checked mandatorily.
Let's remove this superfluous check and the unused entry.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The option --lsm-profile was added with commit:
6af96c8404181e63d2424d1695fd7f8a42a291bf
lsm: add a --lsm-profile flag
In LXD, we use the container name in the LSM profile. If the container name
is changed on migrate (on the host side), we want to use a different LSM
profile name (a. la. --cgroup-root). This flag adds that support.
A usage example is available in
13389b2963
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
The --tcp-close option was introduced with commit
2c37042821906f013634e305899cb25ff1a5a7b1
tcp: Add tcp-close option to restore connected TCP sockets in closed state
This options is applicable only for restore. Therefore, move the
documentation from 'dump' to 'restore'.
Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>