Add optional parameters to override default behavior during restore.
These parameters are passed in as environment variables before executing
CRIU.
List of parameters:
KFD_FW_VER_CHECK - disable firmware version check
KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check
KFD_CACHES_COUNT_CHECK - disable caches count check
KFD_NUM_GWS_CHECK - disable num_gws check
KFD_VRAM_SIZE_CHECK - disable VRAM size check
KFD_NUMA_CHECK - preserve NUMA regions
KFD_CAPABILITY_CHECK - disable capability check
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
The device topology on the restore node can be different from the
topology on the checkpointed node. The GPUs on the restore node may
have different gpu_ids, minor number. or some GPUs may have different
properties as checkpointed node. During restore, the CRIU plugin
determines the target GPUs to avoid restore failures caused by trying
to restore a process on a gpu that is different.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and
store properties for each gpu in the CRIU image files. The gpu
properties can then be used later during restore to make the process is
restored on gpu's with similar properties.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce
a new plugin to assist CRIU with the help of AMD KFD kernel driver. This
initial commit just provides the basic framework to build up further
capabilities. Like CRIU, the amdgpu plugin also uses protobuf to
serialize
and save the amdkfd data which is mostly VRAM contents with some
metadata.
We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore
this file is read and extracted to re-create various types of buffer
objects that belonged to the previously checkpointed process. Upon
restore the mmap page offset within a device file might change so we use
the new hook to update and adjust the mmap offsets for newly created
target process. This is needed for sys_mmap call in pie restorer phase.
Support for queues and events is added in future patches of this series.
With the current implementation (amdgpu_plugin), we support:
- Only compute workloads such (Non Gfx) are supported
- GPU visible inside a container
- AMD GPU Gfx 9 Family
- Pytorch Benchmarks such as BERT Base
amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically
installed with libdrm-dev package. We build amdgpu_plugin only when the
dependencies are met on the target system and when user intends to
install the amdgpu plugin and not by default with criu build.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Co-authored-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
kfd_ioctl.h contains the definitions for the APIs and required arguments
to call the ioctls so simply copy the header as is for amdgpu plugin.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
During premap phase, skip vmas that are handled by external plugins as
their offsets may change when the plugin restores them. This change is
needed when running with criu image streamer.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Adding a dedicated flag for vma's that are handled by an external plugin
as previously used VMA_UNSUPP flag depends on vma not having
VMA_FILE_SHARED flag.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Add a new global function to return unused FD based on the pid. This
function can be used in situations where we need a FD that will not
conflict with FDs used by target restore process, but
struct pstree_item is not available (e.g plugins)
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Some device drivers (e.g DRM) only allow the file descriptor that was
used to create the vma to be used when calling mmap.
In this case, instead of opening a new FD, the plugin will return a
valid FD that can be used for mmap later. The plugin needs to close the
returned FD later. Copies of the returned FD that are created using dup
or fnctl(..,F_DUPFD,..) are references to the same struct file inside
kernel so they are also allowed to mmap.
The plugin does not need to update the path anymore as the plugin can
return a FD for the correct path.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
coverity CID 389187:
3193int veth_pair_add(char *in, char *out)
3194{
3195 char *e_str;
3196
1. alloc_fn: Storage is returned from allocation function malloc.
2. var_assign: Assigning: ___p = storage returned from malloc(200UL).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: e_str = ({...; ___p;}).
3197 e_str = xmalloc(200); /* For 3 IFNAMSIZ + 8 service characters */
6. Condition !e_str, taking false branch.
3198 if (!e_str)
3199 return -1;
7. noescape: Resource e_str is not freed or pointed-to in snprintf.
3200 snprintf(e_str, 200, "veth[%s]:%s", in, out);
8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
CID 389187 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
3201 return add_external(e_str);
3202}
We should free e_str string after we finish it's use in veth_pair_add,
easiest way to do it is to use cleanup_free attribute.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389192:
550static int parse_join_ns(const char *ptr)
551{
...
553 char *ns;
554
1. alloc_fn: Storage is returned from allocation function strdup.
2. var_assign: Assigning: ___p = storage returned from strdup(ptr).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: ns = ({...; ___p;}).
555 ns = xstrdup(ptr);
6. Condition ns == NULL, taking false branch.
556 if (ns == NULL)
557 return -1;
558
7. noescape: Resource ns is not freed or pointed-to in strchr.
559 aux = strchr(ns, ':');
8. Condition aux == NULL, taking true branch.
560 if (aux == NULL)
CID 389192 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable ns going out of scope leaks the storage it points to.
561 return -1;
We should free ns string after we finish it's use in parse_join_ns,
easiest way to do it is to use cleanup_free attribute.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
The config_inotify_irmap test duplicates inotify_irmap with slight
change to add the --force-irmap and --irmap-scan-path options in
a configuration file.
The --criu-config option of ZDTM provides more general solution
for testing CRIU options provided in configuration files.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
The --criu-config option allows to run test with CRIU options provided
via configuration files instead of command-line arguments.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Using long-form command-line options would allows us to provide
them via config file to CRIU.
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
This patch improves the readability of zdtm by refactoring the top-level
code into a main function.
https://docs.python.org/3/library/__main__.html
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
coverity CID 389193:
CID 389193 (#1 of 1): Printf format string issue (PW.BAD_PRINTF_FORMAT_STRING)
1. bad_printf_format_string: invalid format string conversion
598 pr_warn("Can't stat socket %#x(%s), skipping: %m (err %d)\n", id, rpath, errno);
Specifier "%#x" is wrong for id as it is of type uint32_t, let's change
it to "%#" PRIx32 "" to fix the problem.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389205:
452int dump_tun_link(NetDeviceEntry *nde, struct cr_imgset *fds, struct nlattr **info)
453{
...
458 struct tun_link *tl;
...
2. alloc_fn: Storage is returned from allocation function get_tun_link_fd. [show details]
3. var_assign: Assigning: tl = storage returned from get_tun_link_fd(nde->name, nde->peer_nsid, tle.flags).
475 tl = get_tun_link_fd(nde->name, nde->peer_nsid, tle.flags);
4. Condition !tl, taking false branch.
476 if (!tl)
477 return ret;
478
479 tle.vnethdr = tl->dmp.vnethdr;
480 tle.sndbuf = tl->dmp.sndbuf;
481
482 nde->tun = &tle;
CID 389205 (#1 of 1): Resource leak (RESOURCE_LEAK)5. leaked_storage: Variable tl going out of scope leaks the storage it points to.
483 return write_netdev_img(nde, fds, info);
484}
Function get_tun_link_fd() can both return tun_link entry from tun_links
list and a newly allocated one. So we should not free entry if it is
from list and should free it when it is a new one to fix leak.
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
coverity CID 389202:
54int ext_mount_add(char *key, char *val)
55{
56 char *e_str;
57
1. alloc_fn: Storage is returned from allocation function malloc.
2. var_assign: Assigning: ___p = storage returned from malloc(strlen(key) + strlen(val) + 8UL).
3. Condition !___p, taking false branch.
4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
5. var_assign: Assigning: e_str = ({...; ___p;}).
58 e_str = xmalloc(strlen(key) + strlen(val) + 8);
6. Condition !e_str, taking false branch.
59 if (!e_str)
60 return -1;
...
7. noescape: Resource e_str is not freed or pointed-to in sprintf.
73 sprintf(e_str, "mnt[%s]:%s", key, val);
8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
CID 389202 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
74 return add_external(e_str);
75}
We need to free e_str after add_external used it.
v2: use cleanup_free attribute (@adrianreber)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
During error injection tests there are random values loaded in some of
the registers. The kernel, however, has the following check:
if (mxcsr[0] & ~mxcsr_feature_mask)
return -EINVAL;
So depending on the random values loaded mxcsr might have values that
the kernel rejects with EINVAL. Setting mxcsr to zero during the tests
lets the error injection test pass.
Signed-off-by: Adrian Reber <areber@redhat.com>
There is no 'err' argument for print(), it should be in grep_errors() in
line below.
Fixes: bed670f62 ("zdtm: print tails of all logs if a test has failed")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Linux Kernel release 5.16 removed support for LOCK_MAND flock and so the
test to verify if LOCK_MAND works started to fail with 5.16.
The kernel also logs following message:
Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
This fixes CRIU CI using Fedora with 5.16.
See Linux Kernel commit 90f7d7a0d0d68623b5f7df5621a8d54d9518fcc4
"locks: remove LOCK_MAND flock lock support"
Signed-off-by: Adrian Reber <areber@redhat.com>
Starting with Linux Kernel release 5.16 the fdinfo proc entry contains
a map_extra field which breaks CRIU parsing of bpfmap entries.
This commit adds the map_extra as a possible field to CRIU. The value of
map_extra is not passed to the kernel on restore as it does not seem to
be evaluated in the code paths CRIU restore is using for BPF.
This fixes CRIU CI using Fedora with 5.16.
See Linux commit 9330986c03006ab1d33d243b7cfe598a7a3c1baa
"bpf: Add bloom filter map implementation"
Signed-off-by: Adrian Reber <areber@redhat.com>
Currently, hugetlb mappings is not premapped so in the restore content phase, we
skip page read these pages, enqueue the iovec for later reading in restorer and
eventually close the page read. However, image-streamer expects the whole image
to be read and the image is not re-opened, sent twice. These MAP_HUGETLB test
cases will result in EPIPE error. Temporarily disable these test cases for now.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
This commit add a test for checkpoint/restore MAP_HUGETLB memory mappings.
A new zdtm helper get_mapping_dev() is added to get the device number of
the memory mapping.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
As hugetlb mappings are not premapped, they are not registered to uffd service
in restorer code. We must not mark these mappings as PPB_LAZY in generate_iovs()
otherwise when restoring content of these mappings, we will keep looking for in
uffd and get ENOENT because they are not registered.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
As we cannot use mremap() to move the hugetlb mapping around until Linux kernel
version 5.16, we need to skip premapping hugetlb mapping.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
When memfd can be used with hugetlb, we use memfd for checkpoint/restore
anonymous shared memory. Otherwise, map_files symlinks is used for
checkpoint/restore anonymous shared memory.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Attach the System V shared memory segments to the address space via shmat() to
determine if they are backed by hugetlb and their page size. Use these
information for setting the correct flags on restore.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
These numbers are used to determine whether a memory mapping is backed by
hugetlb and its page size.
As the hugepage can be allocated more after the first time we collect kerndat,
we need to collect the missing device numbers every time we load the kerndat
cache.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
When PTRACE_GET_THREAD_AREA errors on kernels with
!CONFIG_IA32_EMULATION beacuse of missing support (-EIO), compel should
ignore uch errors in native mode.
However the check for error type uses return value of ptrace rather than
errno, which will always result in error propagation.
Use errno to detect type of error to fix this.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
As we call mmap syscall directly, the returned value in error case is the error
number not -1 like in libc wrapper. Use IS_ERR for correct checking in error
case.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
os.WEXITSTATUS() returns the process exit status and it should be used
only if WIFEXITED() is true, i.e., the process terminated normally.
os.waitstatus_to_exitcode() does the same as os.WEXITSTATUS() but it
also handles the case when the process has been terminated by a signal.
Suggested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
If we replace old_sid with current_sid we should also do same
replacement for matching pgid (=old_sid).
Reported in CRIU gitter by Younes Manton (@ymanton)
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>