2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 18:07:57 +00:00

7 Commits

Author SHA1 Message Date
David Yat Sin
6d79266229 criu/plugin: Restore libhsakmt shared memory files
Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem
and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during
checkpoint to see if these two files exist. If they exist then the
plugin will try to restore them during restore.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
a218fe0baa criu/plugin: Read and write BO contents in parallel
Implement multi-threaded code to read and write contents of each GPU
VRAM BOs in parallel in order to speed up dumping process when using
multiple GPUs.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
4856e0d4d0 criu/plugin: Add parameters to override mapping
Add optional parameters to override default behavior during restore.
These parameters are passed in as environment variables before executing
CRIU.

List of parameters:

KFD_FW_VER_CHECK - disable firmware version check
KFD_SDMA_FW_VER_CHECK - disable SDMA firmware version check
KFD_CACHES_COUNT_CHECK - disable caches count check
KFD_NUM_GWS_CHECK - disable num_gws check
KFD_VRAM_SIZE_CHECK - disable VRAM size check
KFD_NUMA_CHECK - preserve NUMA regions
KFD_CAPABILITY_CHECK - disable capability check

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
72905c9c9b criu/plugin: Remap GPUs on checkpoint restore
The device topology on the restore node can be different from the
topology on the checkpointed node. The GPUs on the restore node may
have different gpu_ids, minor number. or some GPUs may have different
properties as checkpointed node. During restore, the CRIU plugin
determines the target GPUs to avoid restore failures caused by trying
to restore a process on a gpu that is different.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
6e99fea2fa criu/plugin: Implement system topology parsing
Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and
store properties for each gpu in the CRIU image files. The gpu
properties can then be used later during restore to make the process is
restored on gpu's with similar properties.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
c4e3ac7fef criu/plugin: Adding check for kernel IOCTL version
Adding check for minimum kernel IOCTL version before attempting to
checkpoint.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
Rajneesh Bhardwaj
55a5993bc7 criu/plugin: Support AMD ROCm Checkpoint Restore with KFD
To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce
a new plugin to assist CRIU with the help of AMD KFD kernel driver. This
initial commit just provides the basic framework to build up further
capabilities. Like CRIU, the amdgpu plugin also uses protobuf to
serialize
and save the amdkfd data which is mostly VRAM contents with some
metadata.
We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore
this file is read and extracted to re-create various types of buffer
objects that belonged to the previously checkpointed process. Upon
restore the mmap page offset within a device file might change so we use
the new hook to update and adjust the mmap offsets for newly created
target process. This is needed for sys_mmap call in pie restorer phase.
Support for queues and events is added in future patches of this series.

With the current implementation (amdgpu_plugin), we support:
     - Only compute workloads such (Non Gfx) are supported
     - GPU visible inside a container
     - AMD GPU Gfx 9 Family
     - Pytorch Benchmarks such as BERT Base

amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically
installed with libdrm-dev package. We build amdgpu_plugin only when the
dependencies are met on the target system and when user intends to
install the amdgpu plugin and not by default with criu build.

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Co-authored-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-28 17:53:52 -07:00