2021-12-15 14:18:02 -05:00
|
|
|
ROCM Support(1)
|
|
|
|
===============
|
|
|
|
|
|
|
|
NAME
|
|
|
|
----
|
|
|
|
amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in
|
|
|
|
userspace for AMD GPUs.
|
|
|
|
|
|
|
|
|
|
|
|
CURRENT SUPPORT
|
|
|
|
---------------
|
2022-02-15 21:41:05 -05:00
|
|
|
Single and Multi GPU systems (Gfx9)
|
|
|
|
Checkpoint / Restore on different system
|
2021-12-15 14:18:02 -05:00
|
|
|
Checkpoint / Restore inside a docker container
|
|
|
|
Pytorch
|
|
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
-----------
|
|
|
|
Though *criu* is a great tool for checkpointing and restoring running
|
|
|
|
applications, it has certain limitations such as it cannot handle
|
|
|
|
applications that have device files open. In order to support *ROCm* based
|
|
|
|
workloads with *criu* we need to augment criu's core functionality with a
|
|
|
|
plugin based extension mechanism. *amdgpu_plugin* provides the necessary support
|
|
|
|
to criu to allow Checkpoint / Restore with ROCm.
|
|
|
|
|
|
|
|
|
|
|
|
Dependencies
|
|
|
|
~~~~~~~~~~~~~~
|
|
|
|
*amdkfd support*::
|
|
|
|
In order to snapshot the *VRAM* and other *GPU* device states, we require
|
|
|
|
an updated version of amdkfd(amdgpu) driver. The kernel patches are under
|
|
|
|
review currently.
|
|
|
|
|
|
|
|
*criu 3.16*::
|
|
|
|
This work is rebased on latest criu release available at this time.
|
|
|
|
|
|
|
|
|
|
|
|
AUTHOR
|
|
|
|
------
|
|
|
|
The AMDKFD team.
|
|
|
|
|
|
|
|
|
|
|
|
COPYRIGHT
|
|
|
|
---------
|
|
|
|
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
|