2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 01:51:51 +00:00
criu/Documentation/criu-amdgpu-plugin.txt
David Francis d06c9b5cda criu/plugin: Add environment variable to cap size of buffers.
The amdgpu plugin would create a memory buffer at the size
of the largest VRAM bo (buffer object). On some systems, VRAM
size exceeds RAM size, so the largest bo might be larger than
the available memory.

Add an environment variable KFD_MAX_BUFFER_SIZE, which caps the
size of this buffer. By default, it is set to 0, and has no
effect. When active, any bo larger than its value will be
saved to/restored from file in multiple passes.

Signed-off-by: David Francis <David.Francis@amd.com>
2023-10-22 13:29:25 -07:00

118 lines
3.2 KiB
Plaintext

ROCM Support(1)
===============
NAME
----
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in
userspace for AMD GPUs.
CURRENT SUPPORT
---------------
Single and Multi GPU systems (Gfx9)
Checkpoint / Restore on different system
Checkpoint / Restore inside a docker container
Pytorch
Tensorflow
Using CRIU Image Streamer
DESCRIPTION
-----------
Though *criu* is a great tool for checkpointing and restoring running
applications, it has certain limitations such as it cannot handle
applications that have device files open. In order to support *ROCm* based
workloads with *criu* we need to augment criu's core functionality with a
plugin based extension mechanism. *criu-amdgpu-plugin* provides the necessary support
to criu to allow Checkpoint / Restore with ROCm.
Dependencies
~~~~~~~~~~~~~~
*amdkfd support*::
In order to snapshot the *VRAM* and other *GPU* device states, we require
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
review currently.
*criu 3.16*::
This work is rebased on latest criu release available at this time.
OPTIONS
-------
Optional parameters can be passed in as environment variables before
executing criu command.
*KFD_FW_VER_CHECK*::
Enable or disable firmware version check.
If enabled, firmware version on restored gpu needs to be greater than or
equal firmware version on checkpointed GPU. Default:Enabled
E.g:
KFD_FW_VER_CHECK=0
*KFD_SDMA_FW_VER_CHECK*::
Enable or disable SDMA firmware version check.
If enabled, SDMA firmware version on restored gpu needs to be greater than or
equal firmware version on checkpointed GPU. Default:Enabled
E.g:
KFD_SDMA_FW_VER_CHECK=0
*KFD_CACHES_COUNT_CHECK*::
Enable or disable caches count check. If enabled, the caches count on
restored GPU needs to be greater than or equal caches count on checkpointed
GPU. Default:Enabled
E.g:
KFD_CACHES_COUNT_CHECK=0
*KFD_NUM_GWS_CHECK*::
Enable or disable num_gws check. If enabled, the num_gws on
restored GPU needs to be greater than or equal num_gws on checkpointed
GPU. Default:Enabled
E.g:
KFD_NUM_GWS_CHECK=0
*KFD_VRAM_SIZE_CHECK*::
Enable or disable VRAM size check. If enabled, the VRAM size on
restored GPU needs to be greater than or equal VRAM size on checkpointed
GPU. Default:Enabled
E.g:
KFD_VRAM_SIZE_CHECK=0
*KFD_NUMA_CHECK*::
Enable or disable NUMA CPU region check. If enabled, the plugin will restore
GPUs that belong to one CPU NUMA region to the same CPU NUMA region.
Default:Enabled
E.g:
KFD_NUMA_CHECK=1
*KFD_CAPABILITY_CHECK*::
Enable or disable capability check. If enabled, the capability on
restored GPU needs to be equal to the capability on the checkpointed GPU.
Default:Enabled
E.g:
KFD_CAPABILITY_CHECK=1
*KFD_MAX_BUFFER_SIZE*::
On some systems, VRAM sizes may exceed RAM sizes, and so buffers for dumping
and restoring VRAM may be unable to fit. Set to a nonzero value (in bytes)
to set a limit on the plugin's memory usage.
Default:0 (Disabled)
E.g:
KFD_MAX_BUFFER_SIZE="2G"
AUTHOR
------
The AMDKFD team.
COPYRIGHT
---------
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)