mirror of
https://github.com/checkpoint-restore/criu
synced 2025-08-22 01:51:51 +00:00
The amdgpu plugin would create a memory buffer at the size of the largest VRAM bo (buffer object). On some systems, VRAM size exceeds RAM size, so the largest bo might be larger than the available memory. Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the size of this buffer. By default, it is set to 0, and has no effect. When active, any bo larger than its value will be saved to/restored from file in multiple passes. Signed-off-by: David Francis <David.Francis@amd.com>
118 lines
3.2 KiB
Plaintext
118 lines
3.2 KiB
Plaintext
ROCM Support(1)
|
|
===============
|
|
|
|
NAME
|
|
----
|
|
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in
|
|
userspace for AMD GPUs.
|
|
|
|
|
|
CURRENT SUPPORT
|
|
---------------
|
|
Single and Multi GPU systems (Gfx9)
|
|
Checkpoint / Restore on different system
|
|
Checkpoint / Restore inside a docker container
|
|
Pytorch
|
|
Tensorflow
|
|
Using CRIU Image Streamer
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
Though *criu* is a great tool for checkpointing and restoring running
|
|
applications, it has certain limitations such as it cannot handle
|
|
applications that have device files open. In order to support *ROCm* based
|
|
workloads with *criu* we need to augment criu's core functionality with a
|
|
plugin based extension mechanism. *criu-amdgpu-plugin* provides the necessary support
|
|
to criu to allow Checkpoint / Restore with ROCm.
|
|
|
|
|
|
Dependencies
|
|
~~~~~~~~~~~~~~
|
|
*amdkfd support*::
|
|
In order to snapshot the *VRAM* and other *GPU* device states, we require
|
|
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
|
|
review currently.
|
|
|
|
*criu 3.16*::
|
|
This work is rebased on latest criu release available at this time.
|
|
|
|
OPTIONS
|
|
-------
|
|
Optional parameters can be passed in as environment variables before
|
|
executing criu command.
|
|
|
|
*KFD_FW_VER_CHECK*::
|
|
Enable or disable firmware version check.
|
|
If enabled, firmware version on restored gpu needs to be greater than or
|
|
equal firmware version on checkpointed GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_FW_VER_CHECK=0
|
|
|
|
*KFD_SDMA_FW_VER_CHECK*::
|
|
Enable or disable SDMA firmware version check.
|
|
If enabled, SDMA firmware version on restored gpu needs to be greater than or
|
|
equal firmware version on checkpointed GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_SDMA_FW_VER_CHECK=0
|
|
|
|
*KFD_CACHES_COUNT_CHECK*::
|
|
Enable or disable caches count check. If enabled, the caches count on
|
|
restored GPU needs to be greater than or equal caches count on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_CACHES_COUNT_CHECK=0
|
|
|
|
*KFD_NUM_GWS_CHECK*::
|
|
Enable or disable num_gws check. If enabled, the num_gws on
|
|
restored GPU needs to be greater than or equal num_gws on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_NUM_GWS_CHECK=0
|
|
|
|
*KFD_VRAM_SIZE_CHECK*::
|
|
Enable or disable VRAM size check. If enabled, the VRAM size on
|
|
restored GPU needs to be greater than or equal VRAM size on checkpointed
|
|
GPU. Default:Enabled
|
|
|
|
E.g:
|
|
KFD_VRAM_SIZE_CHECK=0
|
|
|
|
*KFD_NUMA_CHECK*::
|
|
Enable or disable NUMA CPU region check. If enabled, the plugin will restore
|
|
GPUs that belong to one CPU NUMA region to the same CPU NUMA region.
|
|
Default:Enabled
|
|
|
|
E.g:
|
|
KFD_NUMA_CHECK=1
|
|
|
|
*KFD_CAPABILITY_CHECK*::
|
|
Enable or disable capability check. If enabled, the capability on
|
|
restored GPU needs to be equal to the capability on the checkpointed GPU.
|
|
Default:Enabled
|
|
|
|
E.g:
|
|
KFD_CAPABILITY_CHECK=1
|
|
|
|
*KFD_MAX_BUFFER_SIZE*::
|
|
On some systems, VRAM sizes may exceed RAM sizes, and so buffers for dumping
|
|
and restoring VRAM may be unable to fit. Set to a nonzero value (in bytes)
|
|
to set a limit on the plugin's memory usage.
|
|
Default:0 (Disabled)
|
|
|
|
E.g:
|
|
KFD_MAX_BUFFER_SIZE="2G"
|
|
|
|
|
|
AUTHOR
|
|
------
|
|
The AMDKFD team.
|
|
|
|
|
|
COPYRIGHT
|
|
---------
|
|
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
|