mir/criu - criu - Mike's Git repositories

mir/criu

mirror of https://github.com/checkpoint-restore/criu synced 2025-08-28 12:57:57 +00:00

Author	SHA1	Message	Date
Ramesh Errabolu	733ef96315	amdgpu_plugin: Refactor code in preparation to support C&R for DRM devices Add a new compilation unit to host symbols and methods that will be needed to C&R DRM devices. Refactor code that indicates support for C&R and checkpoints KFD and DRM devices Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>	2024-09-11 16:02:11 -07:00
David Yat Sin	55370b720e	criu/plugin: Store BO contents directly to file Store BO contents directly to file (1 per GPU) instead of using protobuf. Bug Fix: Fixes an issue where we could not handle BOs bigger than 4GB because protobuf has an internal limit of 4GB for the Bytes structure. Performance Improvements: This significantly reduces CR duration on multi-GPU systems as it allows reading and writing to disk in parallel. During checkpoint, instead of waiting for all the BO contents to be read from the one protobuf file, we can now start writing the BO contents as soon as the first BO is read from disk. During restore, we can start writing BO contents to disk after the first BO from VRAM. This also reduces the peak amount of system memory used as we only need to keep 1 BO content in memory per GPU at a time instead of all the BO contents. Signed-off-by: David Yat Sin <david.yatsin@amd.com>	2022-04-28 17:53:52 -07:00
David Yat Sin	6d79266229	criu/plugin: Restore libhsakmt shared memory files Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during checkpoint to see if these two files exist. If they exist then the plugin will try to restore them during restore. Signed-off-by: David Yat Sin <david.yatsin@amd.com>	2022-04-28 17:53:52 -07:00
David Yat Sin	72905c9c9b	criu/plugin: Remap GPUs on checkpoint restore The device topology on the restore node can be different from the topology on the checkpointed node. The GPUs on the restore node may have different gpu_ids, minor number. or some GPUs may have different properties as checkpointed node. During restore, the CRIU plugin determines the target GPUs to avoid restore failures caused by trying to restore a process on a gpu that is different. Signed-off-by: David Yat Sin <david.yatsin@amd.com>	2022-04-28 17:53:52 -07:00
David Yat Sin	6e99fea2fa	criu/plugin: Implement system topology parsing Parse local system topology in /sys/class/kfd/kfd/topology/nodes/ and store properties for each gpu in the CRIU image files. The gpu properties can then be used later during restore to make the process is restored on gpu's with similar properties. Signed-off-by: David Yat Sin <david.yatsin@amd.com>	2022-04-28 17:53:52 -07:00
Rajneesh Bhardwaj	55a5993bc7	criu/plugin: Support AMD ROCm Checkpoint Restore with KFD To support Checkpoint Restore with AMDGPUs for ROCm workloads, introduce a new plugin to assist CRIU with the help of AMD KFD kernel driver. This initial commit just provides the basic framework to build up further capabilities. Like CRIU, the amdgpu plugin also uses protobuf to serialize and save the amdkfd data which is mostly VRAM contents with some metadata. We generate a data file "amdgpu-kfd-<id>.img" during the dump stage. On restore this file is read and extracted to re-create various types of buffer objects that belonged to the previously checkpointed process. Upon restore the mmap page offset within a device file might change so we use the new hook to update and adjust the mmap offsets for newly created target process. This is needed for sys_mmap call in pie restorer phase. Support for queues and events is added in future patches of this series. With the current implementation (amdgpu_plugin), we support: - Only compute workloads such (Non Gfx) are supported - GPU visible inside a container - AMD GPU Gfx 9 Family - Pytorch Benchmarks such as BERT Base amdgpu plugin dependes on libdrm and libdrm_amdgpu which are typically installed with libdrm-dev package. We build amdgpu_plugin only when the dependencies are met on the target system and when user intends to install the amdgpu plugin and not by default with criu build. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Co-authored-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>	2022-04-28 17:53:52 -07:00

6 Commits