mirror of
https://github.com/checkpoint-restore/criu
synced 2025-08-22 01:51:51 +00:00
By default, the file name 'amdgpu_plugin.txt' is used also as the name for the corresponding man page (`man amdgpu_plugin`). However, when this man page is installed system-wide it would be more appropriate to have a prefix 'criu-' (e.g., `man criu-amdgpu-plugin`). Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
109 lines
2.9 KiB
Plaintext
109 lines
2.9 KiB
Plaintext
ROCM Support(1)
|
|
===============
|
|
|
|
NAME
|
|
----
|
|
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in
|
|
userspace for AMD GPUs.
|
|
|
|
|
|
CURRENT SUPPORT
|
|
---------------
|
|
Single and Multi GPU systems (Gfx9)
|
|
Checkpoint / Restore on different system
|
|
Checkpoint / Restore inside a docker container
|
|
Pytorch
|
|
Tensorflow
|
|
Using CRIU Image Streamer
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
Though *criu* is a great tool for checkpointing and restoring running
|
|
applications, it has certain limitations such as it cannot handle
|
|
applications that have device files open. In order to support *ROCm* based
|
|
workloads with *criu* we need to augment criu's core functionality with a
|
|
plugin based extension mechanism. *criu-amdgpu-plugin* provides the necessary support
|
|
to criu to allow Checkpoint / Restore with ROCm.
|
|
|
|
|
|
Dependencies
|
|
~~~~~~~~~~~~~~
|
|
*amdkfd support*::
|
|
In order to snapshot the *VRAM* and other *GPU* device states, we require
|
|
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
|
|
review currently.
|
|
|
|
*criu 3.16*::
|
|
This work is rebased on latest criu release available at this time.
|
|
|
|
OPTIONS
|
|
-------
|
|
Optional parameters can be passed in as environment variables before
|
|
executing criu command.
|
|
|
|
*KFD_FW_VER_CHECK*::
|
|
Enable or disable firmware version check.
|
|
If enabled, firmware version on restored gpu needs to be greater than or
|
|
equal firmware version on checkpointed GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_FW_VER_CHECK=0
|
|
|
|
*KFD_SDMA_FW_VER_CHECK*::
|
|
Enable or disable SDMA firmware version check.
|
|
If enabled, SDMA firmware version on restored gpu needs to be greater than or
|
|
equal firmware version on checkpointed GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_SDMA_FW_VER_CHECK=0
|
|
|
|
*KFD_CACHES_COUNT_CHECK*::
|
|
Enable or disable caches count check. If enabled, the caches count on
|
|
restored GPU needs to be greater than or equal caches count on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_CACHES_COUNT_CHECK=0
|
|
|
|
*KFD_NUM_GWS_CHECK*::
|
|
Enable or disable num_gws check. If enabled, the num_gws on
|
|
restored GPU needs to be greater than or equal num_gws on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_NUM_GWS_CHECK=0
|
|
|
|
*KFD_VRAM_SIZE_CHECK*::
|
|
Enable or disable VRAM size check. If enabled, the VRAM size on
|
|
restored GPU needs to be greater than or equal VRAM size on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_VRAM_SIZE_CHECK=0
|
|
|
|
*KFD_NUMA_CHECK*::
|
|
Enable or disable NUMA CPU region check. If enabled, the plugin will restore
|
|
GPUs that belong to one CPU NUMA region to the same CPU NUMA region.
|
|
Default:Enabled
|
|
|
|
E.g:
|
|
KFD_NUMA_CHECK=1
|
|
|
|
*KFD_CAPABILITY_CHECK*::
|
|
Enable or disable capability check. If enabled, the capability on
|
|
restored GPU needs to be equal to the capability on the checkpointed GPU.
|
|
Default:Enabled
|
|
|
|
E.g:
|
|
KFD_CAPABILITY_CHECK=1
|
|
|
|
|
|
AUTHOR
|
|
------
|
|
The AMDKFD team.
|
|
|
|
|
|
COPYRIGHT
|
|
---------
|
|
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
|