2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 18:07:57 +00:00

7 Commits

Author SHA1 Message Date
Kir Kolyshkin
0194ed392f Fix some codespell warnings
Brought to you by

	codespell -w

(using codespell v2.1.0).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
2095de9f03 criu/plugin: Fix for FDs not allowed to mmap
On newer kernel's (> 5.13), KFD & DRM drivers will only allow the
/dev/renderD* file descriptors that were used during the CRIU_RESTORE
ioctl when calling mmap for the vma's.
During restore, after opening /dev/renderD*, amdgpu_plugin keeps the
FDs opened and instead returns a copy of the FDs to CRIU. The same FDs
are then returned during the UPDATE_VMAMAP hooks so that they can be
used by CRIU to call mmap. Duplicated FDs created using dup are
references to the same struct file inside the kernel so they are also
allowed to mmap.
To prevent the opened FDs inside amdgpu_plugin from conflicting with
FDs used by the target restore application, we make sure that the
lowest-numbered FD that amdgpu_plugin will use is greater than the
highest-numbered FD that is used by the target application.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
a218fe0baa criu/plugin: Read and write BO contents in parallel
Implement multi-threaded code to read and write contents of each GPU
VRAM BOs in parallel in order to speed up dumping process when using
multiple GPUs.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
ba9c62df24 criu/plugin: Add unit tests for GPU remapping
Adding unit tests for GPU remapping code when checkpointing and
restoring on different nodes with different topologies.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
4856e0d4d0 criu/plugin: Add parameters to override mapping
Add optional parameters to override default behavior during restore.
These parameters are passed in as environment variables before executing
CRIU.

List of parameters:

KFD_FW_VER_CHECK - disable firmware version check
KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check
KFD_CACHES_COUNT_CHECK - disable caches count check
KFD_NUM_GWS_CHECK - disable num_gws check
KFD_VRAM_SIZE_CHECK - disable VRAM size check
KFD_NUMA_CHECK - preserve NUMA regions
KFD_CAPABILITY_CHECK - disable capability check

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
72905c9c9b criu/plugin: Remap GPUs on checkpoint restore
The device topology on the restore node can be different from the
topology on the checkpointed node. The GPUs on the restore node may
have different gpu_ids, minor number. or some GPUs may have different
properties as checkpointed node. During restore, the CRIU plugin
determines the target GPUs to avoid restore failures caused by trying
to restore a process on a gpu that is different.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
6e99fea2fa criu/plugin: Implement system topology parsing
Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and
store properties for each gpu in the CRIU image files. The gpu
properties can then be used later during restore to make the process is
restored on gpu's with similar properties.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00