Open
Description
Hi, I have a KVM with a AMD Instinct MI300X VF GPU. When I try to restore a PyTorch example workload running in podman container, the workload hangs after restore. Both checkpoint and restore logs shows success.
Setup
systemd-detect-virt
kvm
dmidecode -s system-product-name
Standard PC (Q35 + ICH9, 2009)
rocminfo
ROCk module version 6.12.12 is loaded
cat /opt/rocm/.info/version
6.4.0-47
lspci | grep AMD
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
CRIU version: v4.1
OS:
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Commands to reproduce
apt install -y podman criu runc
Start workload(using the container image mentioned from ROCm official doc: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html):
git clone https://github.com/pytorch/examples.git
podman run -it --runtime runc --log-driver k8s-file -v $(pwd):$(pwd) --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video docker.io/rocm/dev-ubuntu-22.04 python3 /home/tomz/examples/mnist/main.py
Workload is running, checkpoint:
rocm-smi --showpids
============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY
23614 python3 1 1629364224 0 0
==========================================================================================
================================== End of ROCm SMI Log ===================================
podman container checkpoint <id> -k
Restore and check usage
podman container restore <id> -k
rocm-smi --showpids
============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY
24118 python3 1 2036998144 0 0
==========================================================================================
================================== End of ROCm SMI Log ===================================
Notice that the VRAM usage increases after restore. After restore, the workload process hangs forever.
CRIU logs related to amdgpu_plugin
Dump log:
29:(00.037459) amdgpu_plugin: initialized: amdgpu_plugin (AMDGPU/KFD)
30:(00.037467) amdgpu_plugin: Value str: (null)
31:(00.037468) amdgpu_plugin: param: KFD_MAX_BUFFER_SIZE:0x0
1862:(00.096418) amdgpu_plugin: amdgpu_plugin_dump_file() called for fd = 237
1863:(00.096532) amdgpu_plugin: devices:1 bos:159 objects:86 priv_data:82504
1864:(00.096759) amdgpu_plugin: Dumped devices Ok (ret:0)
1865:(00.096965) amdgpu_plugin: Thread[0x5759] started
1866:(01.135432) amdgpu_plugin: Thread[0x5759] done num_bos:74 ret:0
1867:(01.139960) amdgpu_plugin: Thread[0x5759] finished ret:0
1868:(01.139999) amdgpu_plugin: Dumped bos ok (ret:0)
1869:(01.140015) amdgpu_plugin: img_path = amdgpu-kfd-234.img
1870:(01.140028) amdgpu_plugin: Len = 87520
1871:(01.140190) amdgpu_plugin: Dump successful
1880:(01.140584) amdgpu_plugin: Process unpaused Ok (ret:0)
7587:(02.867906) amdgpu_plugin: finished amdgpu_plugin (AMDGPU/KFD)
Restore log:
13:(00.001087) amdgpu_plugin: initialized: amdgpu_plugin (AMDGPU/KFD)
14:(00.001093) amdgpu_plugin: param: KFD_FW_VER_CHECK:Y
15:(00.001094) amdgpu_plugin: param: KFD_SDMA_FW_VER_CHECK:Y
16:(00.001096) amdgpu_plugin: param: KFD_CACHES_COUNT_CHECK:Y
17:(00.001097) amdgpu_plugin: param: KFD_NUM_GWS_CHECK:Y
18:(00.001098) amdgpu_plugin: param: KFD_VRAM_SIZE_CHECK:Y
19:(00.001099) amdgpu_plugin: param: KFD_NUMA_CHECK:Y
20:(00.001101) amdgpu_plugin: param: KFD_CAPABILITY_CHECK:Y
21:(00.001103) amdgpu_plugin: Value str: (null)
22:(00.001104) amdgpu_plugin: param: KFD_MAX_BUFFER_SIZE:0x0
1121:(02.850329) 1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 234
1178:(02.850565) 1: amdgpu_plugin: Opened kfd, fd = 4
1179:(02.850574) 1: amdgpu_plugin: KFD Image file size:87520
1180:(02.850873) 1: amdgpu_plugin: ===System Topology=[ Source ]==================================
1181:(02.850875) 1: amdgpu_plugin: [1] GPU gpu_id:0x5759
1182:(02.850877) 1: amdgpu_plugin: vendor_id:4098 device_id:29877
1183:(02.850878) 1: amdgpu_plugin: vram_public:Y vram_size:205822885888
1184:(02.850880) 1: amdgpu_plugin: io_links_count:1 capability:2893521536
1185:(02.850881) 1: amdgpu_plugin: mem_banks_count:1 caches_count:626 lds_size_in_kb:64
1186:(02.850883) 1: amdgpu_plugin: simd_count:1216 max_waves_per_simd:8
1187:(02.850884) 1: amdgpu_plugin: num_gws:64 wave_front_size:64 array_count:32
1188:(02.850885) 1: amdgpu_plugin: simd_arrays_per_engine:1 simd_per_cu:4
1189:(02.850887) 1: amdgpu_plugin: max_slots_scratch_cu:32 cu_per_simd_array:10
1190:(02.850888) 1: amdgpu_plugin: num_sdma_engines:2
1191:(02.850889) 1: amdgpu_plugin: num_sdma_xgmi_engines:14 num_sdma_queues_per_engine:8
1192:(02.850890) 1: amdgpu_plugin: num_cp_queues:30 fw_version:32945 sdma_fw_version:24
1193:(02.850892) 1: amdgpu_plugin: iolink type:PCIe node-to:0 (0x0000) node-from:1 bi-dir:Y
1194:(02.850894) 1: amdgpu_plugin: [0] CPU
1195:(02.850895) 1: amdgpu_plugin: cpu_cores_count:13
1196:(02.850896) 1: amdgpu_plugin: iolink type:PCIe node-to:1 (0x5759) node-from:0 bi-dir:Y
1197:(02.850897) 1: amdgpu_plugin: ===Groups==========================================================
1198:(02.850899) 1: amdgpu_plugin: ===================================================================
1199:(02.850900) 1: amdgpu_plugin: ===System Topology=[ Destination]==================================
1200:(02.850911) 1: amdgpu_plugin: [1] GPU gpu_id:0x5759
1201:(02.850913) 1: amdgpu_plugin: vendor_id:4098 device_id:29877
1202:(02.850914) 1: amdgpu_plugin: vram_public:Y vram_size:205822885888
1203:(02.850915) 1: amdgpu_plugin: io_links_count:1 capability:2893521536
1204:(02.850916) 1: amdgpu_plugin: mem_banks_count:1 caches_count:626 lds_size_in_kb:64
1205:(02.850918) 1: amdgpu_plugin: simd_count:1216 max_waves_per_simd:8
1206:(02.850919) 1: amdgpu_plugin: num_gws:64 wave_front_size:64 array_count:32
1207:(02.850920) 1: amdgpu_plugin: simd_arrays_per_engine:1 simd_per_cu:4
1208:(02.850921) 1: amdgpu_plugin: max_slots_scratch_cu:32 cu_per_simd_array:10
1209:(02.850923) 1: amdgpu_plugin: num_sdma_engines:2
1210:(02.850924) 1: amdgpu_plugin: num_sdma_xgmi_engines:14 num_sdma_queues_per_engine:8
1211:(02.850925) 1: amdgpu_plugin: num_cp_queues:30 fw_version:32945 sdma_fw_version:24
1212:(02.850926) 1: amdgpu_plugin: iolink type:PCIe node-to:0 (0x0000) node-from:1 bi-dir:Y
1213:(02.850927) 1: amdgpu_plugin: [0] CPU
1214:(02.850929) 1: amdgpu_plugin: cpu_cores_count:13
1215:(02.850930) 1: amdgpu_plugin: iolink type:PCIe node-to:1 (0x5759) node-from:0 bi-dir:Y
1216:(02.850931) 1: amdgpu_plugin: ===Groups==========================================================
1217:(02.850932) 1: amdgpu_plugin: ===================================================================
1218:(02.850937) 1: amdgpu_plugin: ===Maps===============
1219:(02.850939) 1: amdgpu_plugin: GPU: 0x5759 -> 0x5759
1220:(02.850940) 1: amdgpu_plugin: CPU: 00 -> 00
1221:(02.850941) 1: amdgpu_plugin: ======================
1222:(02.850943) 1: amdgpu_plugin: Maps after all nodes matched
1223:(02.850944) 1: amdgpu_plugin: ===Maps===============
1224:(02.850945) 1: amdgpu_plugin: GPU: 0x5759 -> 0x5759
1225:(02.850946) 1: amdgpu_plugin: CPU: 00 -> 00
1226:(02.850948) 1: amdgpu_plugin: ======================
1227:(02.851165) 1: amdgpu_plugin: passing drm render fd = 35 to driver
1228:(02.851167) 1: amdgpu_plugin: Restore devices Ok (ret:0)
1229:(02.851174) 1: amdgpu_plugin: Restore BOs Ok
1230:(02.854044) 1: amdgpu_plugin: Thread[0x5759] started
1231:(03.183827) 1: amdgpu_plugin: Thread[0x5759] done num_bos:74 ret:0
1232:(03.187578) 1: amdgpu_plugin: Thread[0x5759] finished ret:0
1233:(03.187637) 1: amdgpu_plugin: Restore successful (fd:4)
1244:(03.187767) 1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 238
1250:(03.187782) 1: Error (amdgpu_plugin_util.c:120): amdgpu_plugin: amdgpu-kfd-238.img: Failed to open for read
1252:(03.187787) 1: amdgpu_plugin: Restoring RenderD amdgpu-renderD-238.img
1260:(03.187802) 1: amdgpu_plugin: render node gpu_id = 0x5759
1262:(03.187804) 1: amdgpu_plugin: render node destination gpu_id = 0x5759
1331:(03.187925) 1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 247
1334:(03.187936) 1: Error (amdgpu_plugin_util.c:120): amdgpu_plugin: amdgpu-kfd-247.img: Failed to open for read
1336:(03.187939) 1: amdgpu_plugin: Restoring RenderD amdgpu-renderD-247.img
1341:(03.187946) 1: amdgpu_plugin: render node gpu_id = 0x5759
1343:(03.187948) 1: amdgpu_plugin: render node destination gpu_id = 0x5759
3944:(03.323798) amdgpu_plugin: Inside amdgpu_plugin_resume_devices_late for target pid = 24118
3945:(03.323847) amdgpu_plugin: Calling IOCTL to start notifiers and queues
3946:(03.437808) amdgpu_plugin: Inside amdgpu_plugin_resume_devices_late for target pid = 24141
3947:(03.437859) amdgpu_plugin: Calling IOCTL to start notifiers and queues
3948:(03.437868) amdgpu_plugin: Pid 24141 has no kfd process info
3953:(03.438717) amdgpu_plugin: finished amdgpu_plugin (AMDGPU/KFD)
Full CRIU log
dump.log
Please let me know if you need any more info.