8000 AMD GPU restore success but process hang forever · Issue #2663 · checkpoint-restore/criu · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
AMD GPU restore success but process hang forever #2663
Open
@Tianyang-Zhang

Description

@Tianyang-Zhang

Hi, I have a KVM with a AMD Instinct MI300X VF GPU. When I try to restore a PyTorch example workload running in podman container, the workload hangs after restore. Both checkpoint and restore logs shows success.

Setup

systemd-detect-virt
kvm

dmidecode -s system-product-name
Standard PC (Q35 + ICH9, 2009)

rocminfo
ROCk module version 6.12.12 is loaded

cat /opt/rocm/.info/version
6.4.0-47

lspci | grep AMD
05:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]

CRIU version: v4.1

OS:

cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Commands to reproduce

apt install -y podman criu runc

Start workload(using the container image mentioned from ROCm official doc: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html):

git clone https://github.com/pytorch/examples.git

podman run -it --runtime runc --log-driver k8s-file -v $(pwd):$(pwd) --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video docker.io/rocm/dev-ubuntu-22.04 python3 /home/tomz/examples/mnist/main.py

Workload is running, checkpoint:

rocm-smi --showpids
============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
23614   python3         1       1629364224      0               0
==========================================================================================
================================== End of ROCm SMI Log ===================================

podman container checkpoint <id> -k

Restore and check usage

podman container restore <id> -k

rocm-smi --showpids
============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
24118   python3         1       2036998144      0               0
==========================================================================================
================================== End of ROCm SMI Log ===================================

Notice that the VRAM usage increases after restore. After restore, the workload process hangs forever.

CRIU logs related to amdgpu_plugin
Dump log:

29:(00.037459) amdgpu_plugin: initialized:  amdgpu_plugin (AMDGPU/KFD)
30:(00.037467) amdgpu_plugin: Value str: (null)
31:(00.037468) amdgpu_plugin: param: KFD_MAX_BUFFER_SIZE:0x0
1862:(00.096418) amdgpu_plugin: amdgpu_plugin_dump_file() called for fd = 237
1863:(00.096532) amdgpu_plugin: devices:1 bos:159 objects:86 priv_data:82504
1864:(00.096759) amdgpu_plugin: Dumped devices Ok (ret:0)
1865:(00.096965) amdgpu_plugin: Thread[0x5759] started
1866:(01.135432) amdgpu_plugin: Thread[0x5759] done num_bos:74 ret:0
1867:(01.139960) amdgpu_plugin: Thread[0x5759] finished ret:0
1868:(01.139999) amdgpu_plugin: Dumped bos ok (ret:0)
1869:(01.140015) amdgpu_plugin: img_path = amdgpu-kfd-234.img
1870:(01.140028) amdgpu_plugin: Len = 87520
1871:(01.140190) amdgpu_plugin: Dump successful
1880:(01.140584) amdgpu_plugin: Process unpaused Ok (ret:0)
7587:(02.867906) amdgpu_plugin: finished  amdgpu_plugin (AMDGPU/KFD)

Restore log:

13:(00.001087) amdgpu_plugin: initialized:  amdgpu_plugin (AMDGPU/KFD)
14:(00.001093) amdgpu_plugin: param: KFD_FW_VER_CHECK:Y
15:(00.001094) amdgpu_plugin: param: KFD_SDMA_FW_VER_CHECK:Y
16:(00.001096) amdgpu_plugin: param: KFD_CACHES_COUNT_CHECK:Y
17:(00.001097) amdgpu_plugin: param: KFD_NUM_GWS_CHECK:Y
18:(00.001098) amdgpu_plugin: param: KFD_VRAM_SIZE_CHECK:Y
19:(00.001099) amdgpu_plugin: param: KFD_NUMA_CHECK:Y
20:(00.001101) amdgpu_plugin: param: KFD_CAPABILITY_CHECK:Y
21:(00.001103) amdgpu_plugin: Value str: (null)
22:(00.001104) amdgpu_plugin: param: KFD_MAX_BUFFER_SIZE:0x0
1121:(02.850329)      1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 234
1178:(02.850565)      1: amdgpu_plugin: Opened kfd, fd = 4
1179:(02.850574)      1: amdgpu_plugin: KFD Image file size:87520
1180:(02.850873)      1: amdgpu_plugin: ===System Topology=[  Source    ]==================================
1181:(02.850875)      1: amdgpu_plugin: [1] GPU gpu_id:0x5759
1182:(02.850877)      1: amdgpu_plugin:      vendor_id:4098 device_id:29877
1183:(02.850878)      1: amdgpu_plugin:      vram_public:Y vram_size:205822885888
1184:(02.850880)      1: amdgpu_plugin:      io_links_count:1 capability:2893521536
1185:(02.850881)      1: amdgpu_plugin:      mem_banks_count:1 caches_count:626 lds_size_in_kb:64
1186:(02.850883)      1: amdgpu_plugin:      simd_count:1216 max_waves_per_simd:8
1187:(02.850884)      1: amdgpu_plugin:      num_gws:64 wave_front_size:64 array_count:32
1188:(02.850885)      1: amdgpu_plugin:      simd_arrays_per_engine:1 simd_per_cu:4
1189:(02.850887)      1: amdgpu_plugin:      max_slots_scratch_cu:32 cu_per_simd_array:10
1190:(02.850888)      1: amdgpu_plugin:      num_sdma_engines:2
1191:(02.850889)      1: amdgpu_plugin:      num_sdma_xgmi_engines:14 num_sdma_queues_per_engine:8
1192:(02.850890)      1: amdgpu_plugin:      num_cp_queues:30 fw_version:32945 sdma_fw_version:24
1193:(02.850892)      1: amdgpu_plugin:      iolink type:PCIe node-to:0 (0x0000) node-from:1 bi-dir:Y
1194:(02.850894)      1: amdgpu_plugin: [0] CPU
1195:(02.850895)      1: amdgpu_plugin:      cpu_cores_count:13
1196:(02.850896)      1: amdgpu_plugin:      iolink type:PCIe node-to:1 (0x5759) node-from:0 bi-dir:Y
1197:(02.850897)      1: amdgpu_plugin: ===Groups==========================================================
1198:(02.850899)      1: amdgpu_plugin: ===================================================================
1199:(02.850900)      1: amdgpu_plugin: ===System Topology=[ Destination]==================================
1200:(02.850911)      1: amdgpu_plugin: [1] GPU gpu_id:0x5759
1201:(02.850913)      1: amdgpu_plugin:      vendor_id:4098 device_id:29877
1202:(02.850914)      1: amdgpu_plugin:      vram_public:Y vram_size:205822885888
1203:(02.850915)      1: amdgpu_plugin:      io_links_count:1 capability:2893521536
1204:(02.850916)      1: amdgpu_plugin:      mem_banks_count:1 caches_count:626 lds_size_in_kb:64
1205:(02.850918)      1: amdgpu_plugin:      simd_count:1216 max_waves_per_simd:8
1206:(02.850919)      1: amdgpu_plugin:      num_gws:64 wave_front_size:64 array_count:32
1207:(02.850920)      1: amdgpu_plugin:      simd_arrays_per_engine:1 simd_per_cu:4
1208:(02.850921)      1: amdgpu_plugin:      max_slots_scratch_cu:32 cu_per_simd_array:10
1209:(02.850923)      1: amdgpu_plugin:      num_sdma_engines:2
1210:(02.850924)      1: amdgpu_plugin:      num_sdma_xgmi_engines:14 num_sdma_queues_per_engine:8
1211:(02.850925)      1: amdgpu_plugin:      num_cp_queues:30 fw_version:32945 sdma_fw_version:24
1212:(02.850926)      1: amdgpu_plugin:      iolink type:PCIe node-to:0 (0x0000) node-from:1 bi-dir:Y
1213:(02.850927)      1: amdgpu_plugin: [0] CPU
1214:(02.850929)      1: amdgpu_plugin:      cpu_cores_count:13
1215:(02.850930)      1: amdgpu_plugin:      iolink type:PCIe node-to:1 (0x5759) node-from:0 bi-dir:Y
1216:(02.850931)      1: amdgpu_plugin: ===Groups==========================================================
1217:(02.850932)      1: amdgpu_plugin: ===================================================================
1218:(02.850937)      1: amdgpu_plugin: ===Maps===============
1219:(02.850939)      1: amdgpu_plugin: GPU: 0x5759 -> 0x5759
1220:(02.850940)      1: amdgpu_plugin: CPU: 00 -> 00
1221:(02.850941)      1: amdgpu_plugin: ======================
1222:(02.850943)      1: amdgpu_plugin: Maps after all nodes matched
1223:(02.850944)      1: amdgpu_plugin: ===Maps===============
1224:(02.850945)      1: amdgpu_plugin: GPU: 0x5759 -> 0x5759
1225:(02.850946)      1: amdgpu_plugin: CPU: 00 -> 00
1226:(02.850948)      1: amdgpu_plugin: ======================
1227:(02.851165)      1: amdgpu_plugin: passing drm render fd = 35 to driver
1228:(02.851167)      1: amdgpu_plugin: Restore devices Ok (ret:0)
1229:(02.851174)      1: amdgpu_plugin: Restore BOs Ok
1230:(02.854044)      1: amdgpu_plugin: Thread[0x5759] started
1231:(03.183827)      1: amdgpu_plugin: Thread[0x5759] done num_bos:74 ret:0
1232:(03.187578)      1: amdgpu_plugin: Thread[0x5759] finished ret:0
1233:(03.187637)      1: amdgpu_plugin: Restore successful (fd:4)
1244:(03.187767)      1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 238
1250:(03.187782)      1: Error (amdgpu_plugin_util.c:120): amdgpu_plugin: amdgpu-kfd-238.img: Failed to open for read
1252:(03.187787)      1: amdgpu_plugin: Restoring RenderD amdgpu-renderD-238.img
1260:(03.187802)      1: amdgpu_plugin: render node gpu_id = 0x5759
1262:(03.187804)      1: amdgpu_plugin: render node destination gpu_id = 0x5759
1331:(03.187925)      1: amdgpu_plugin: Initialized kfd plugin restorer with ID = 247
1334:(03.187936)      1: Error (amdgpu_plugin_util.c:120): amdgpu_plugin: amdgpu-kfd-247.img: Failed to open for read
1336:(03.187939)      1: amdgpu_plugin: Restoring RenderD amdgpu-renderD-247.img
1341:(03.187946)      1: amdgpu_plugin: render node gpu_id = 0x5759
1343:(03.187948)      1: amdgpu_plugin: render node destination gpu_id = 0x5759
3944:(03.323798) amdgpu_plugin: Inside amdgpu_plugin_resume_devices_late for target pid = 24118
3945:(03.323847) amdgpu_plugin: Calling IOCTL to start notifiers and queues
3946:(03.437808) amdgpu_plugin: Inside amdgpu_plugin_resume_devices_late for target pid = 24141
3947:(03.437859) amdgpu_plugin: Calling IOCTL to start notifiers and queues
3948:(03.437868) amdgpu_plugin: Pid 24141 has no kfd process info
3953:(03.438717) amdgpu_plugin: finished  amdgpu_plugin (AMDGPU/KFD)

Full CRIU log
dump.log

restore.log

Please let me know if you need any more info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0