mirror of
https://github.com/checkpoint-restore/criu
synced 2025-08-22 01:51:51 +00:00
plugins/amdgpu: Update README.md
and criu-amdgpu-plugin.txt
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
This commit is contained in:
parent
bfb4a3d842
commit
7c4bcdb2d4
@ -15,6 +15,7 @@ Checkpoint / Restore inside a docker container
|
|||||||
Pytorch
|
Pytorch
|
||||||
Tensorflow
|
Tensorflow
|
||||||
Using CRIU Image Streamer
|
Using CRIU Image Streamer
|
||||||
|
Parallel Restore
|
||||||
|
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
-----------
|
-----------
|
||||||
|
@ -3,7 +3,8 @@ Supporting ROCm with CRIU
|
|||||||
|
|
||||||
_Felix Kuehling <Felix.Kuehling@amd.com>_<br>
|
_Felix Kuehling <Felix.Kuehling@amd.com>_<br>
|
||||||
_Rajneesh Bardwaj <Rajneesh.Bhardwaj@amd.com>_<br>
|
_Rajneesh Bardwaj <Rajneesh.Bhardwaj@amd.com>_<br>
|
||||||
_David Yat Sin <David.YatSin@amd.com>_
|
_David Yat Sin <David.YatSin@amd.com>_<br>
|
||||||
|
_Yanning Yang <yangyanning@sjtu.edu.cn>_
|
||||||
|
|
||||||
# Introduction
|
# Introduction
|
||||||
|
|
||||||
@ -224,6 +225,26 @@ to resume execution on the GPUs.
|
|||||||
*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC
|
*This new plugin is enabled by the new hook `__RESUME_DEVICES_LATE` in our RFC
|
||||||
patch series.*
|
patch series.*
|
||||||
|
|
||||||
|
## Restoring BO content in parallel
|
||||||
|
|
||||||
|
Restoring the BO content is an important part in the restore of GPU state and
|
||||||
|
usually takes a significant amount of time. A possible location for this
|
||||||
|
procedure is the `cr_plugin_restore_file` hook. However, restoring in this hook
|
||||||
|
blocks the target process from performing other restore operations, which
|
||||||
|
hinders further optimization of the restore process.
|
||||||
|
|
||||||
|
Therefore, a new plugin hook that runs in the master restore process is
|
||||||
|
introduced, and it interacts with the `cr_plugin_restore_file` hook to complete
|
||||||
|
the restore of BO content. Specifically, the target process only needs to send
|
||||||
|
the relevant BOs to the master restore process, while this new hook handles all
|
||||||
|
the restore of buffer objects. Through this method, during the restore of the BO
|
||||||
|
content, the target process can perform other restore operations, thus
|
||||||
|
accelerating the restore procedure. This is an implementation of the gCROP
|
||||||
|
method proposed in the ACM SoCC'24 paper: [On-demand and Parallel
|
||||||
|
Checkpoint/Restore for GPU Applications](https://dl.acm.org/doi/10.1145/3698038.3698510).
|
||||||
|
|
||||||
|
*This optimization technique is enabled by the `__POST_FORKING` hook.*
|
||||||
|
|
||||||
## Other CRIU changes
|
## Other CRIU changes
|
||||||
|
|
||||||
In addition to the new plugins, we need to make some changes to CRIU itself to
|
In addition to the new plugins, we need to make some changes to CRIU itself to
|
||||||
|
Loading…
x
Reference in New Issue
Block a user