2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 01:51:51 +00:00
criu/Documentation/amdgpu_plugin.txt

108 lines
2.9 KiB
Plaintext
Raw Normal View History

criu/plugin: Support AMD ROCm Checkpoint Restore with KFD To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2021-12-15 14:18:02 -05:00
ROCM Support(1)
===============
NAME
----
amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in
userspace for AMD GPUs.
CURRENT SUPPORT
---------------
Single and Multi GPU systems (Gfx9)
Checkpoint / Restore on different system
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2021-12-15 14:18:02 -05:00
Checkpoint / Restore inside a docker container
Pytorch
Tensorflow
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2021-12-15 14:18:02 -05:00
DESCRIPTION
-----------
Though *criu* is a great tool for checkpointing and restoring running
applications, it has certain limitations such as it cannot handle
applications that have device files open. In order to support *ROCm* based
workloads with *criu* we need to augment criu's core functionality with a
plugin based extension mechanism. *amdgpu_plugin* provides the necessary support
to criu to allow Checkpoint / Restore with ROCm.
Dependencies
~~~~~~~~~~~~~~
*amdkfd support*::
In order to snapshot the *VRAM* and other *GPU* device states, we require
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
review currently.
*criu 3.16*::
This work is rebased on latest criu release available at this time.
OPTIONS
-------
Optional parameters can be passed in as environment variables before
executing criu command.
*KFD_FW_VER_CHECK*::
Enable or disable firmware version check.
If enabled, firmware version on restored gpu needs to be greater than or
equal firmware version on checkpointed GPU. Default:Enabled
E.g:
KFD_FW_VER_CHECK=0
*KFD_SDMA_FW_VER_CHECK*::
Enable or disable SDMA firmware version check.
If enabled, SDMA firmware version on restored gpu needs to be greater than or
equal firmware version on checkpointed GPU. Default:Enabled
E.g:
KFD_SDMA_FW_VER_CHECK=0
*KFD_CACHES_COUNT_CHECK*::
Enable or disable caches count check. If enabled, the caches count on
restored GPU needs to be greater than or equal caches count on checkpointed
GPU. Default:Enabled
E.g:
KFD_CACHES_COUNT_CHECK=0
*KFD_NUM_GWS_CHECK*::
Enable or disable num_gws check. If enabled, the num_gws on
restored GPU needs to be greater than or equal num_gws on checkpointed
GPU. Default:Enabled
E.g:
KFD_NUM_GWS_CHECK=0
*KFD_VRAM_SIZE_CHECK*::
Enable or disable VRAM size check. If enabled, the VRAM size on
restored GPU needs to be greater than or equal VRAM size on checkpointed
GPU. Default:Enabled
E.g:
KFD_VRAM_SIZE_CHECK=0
*KFD_NUMA_CHECK*::
Enable or disable NUMA CPU region check. If enabled, the plugin will restore
GPUs that belong to one CPU NUMA region to the same CPU NUMA region.
Default:Enabled
E.g:
KFD_NUMA_CHECK=1
*KFD_CAPABILITY_CHECK*::
Enable or disable capability check. If enabled, the capability on
restored GPU needs to be equal to the capability on the checkpointed GPU.
Default:Enabled
E.g:
KFD_CAPABILITY_CHECK=1
criu/plugin: Support AMD ROCm Checkpoint Restore with KFD To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2021-12-15 14:18:02 -05:00
AUTHOR
------
The AMDKFD team.
COPYRIGHT
---------
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)