NCCL allreduce is slower than others in some certain process groups #820

sneaxiy · 2023-04-11T14:01:05Z

I had 4-nodes and only use the first 2 nodes to create 8 process groups to perform all-reduce tests. Each node has 8 NVIDIA A100-SXM4-80GB GPUs, and 8 200Gb/s IB NICs. The i-th process group contains all of the i-th GPUs in the first 2 nodes. All of these process groups would perform all-reduce operations simultaneously with same data size. But it is strange that the time cost of each process group is not the same. The results are as follows, and show that the all-reduce operations on GPU 0 and GPU 1 are slower than others.

node_id 0 local_rank 0 5313036288 bytes: 0.43919867515563965 sec
node_id 0 local_rank 1 5313036288 bytes: 0.43919461727142334 sec
node_id 0 local_rank 2 5313036288 bytes: 0.2772217226028442 sec
node_id 0 local_rank 3 5313036288 bytes: 0.27647966384887696 sec
node_id 0 local_rank 4 5313036288 bytes: 0.27740748405456545 sec
node_id 0 local_rank 5 5313036288 bytes: 0.27682522773742674 sec
node_id 0 local_rank 6 5313036288 bytes: 0.2769665336608887 sec
node_id 0 local_rank 7 5313036288 bytes: 0.27710145473480224 sec
node_id 1 local_rank 0 5313036288 bytes: 0.439218635559082 sec
node_id 1 local_rank 1 5313036288 bytes: 0.4392137575149536 sec
node_id 1 local_rank 2 5313036288 bytes: 0.27723480224609376 sec
node_id 1 local_rank 3 5313036288 bytes: 0.276490592956543 sec
node_id 1 local_rank 4 5313036288 bytes: 0.27741519451141355 sec
node_id 1 local_rank 5 5313036288 bytes: 0.2768456268310547 sec
node_id 1 local_rank 6 5313036288 bytes: 0.27698177814483643 sec
node_id 1 local_rank 7 5313036288 bytes: 0.2771081018447876 sec

I tried to use NCCL v2.12.10-1, v2.12.12-1 and v2.15.5-1, and export NCCL_CROSS_NIC=0 or unset NCCL_CROSS_NIC. But none of them solved my problems.

Here are my informations about my environments.

Here is the output of the command lspci -tv. The PCIe topology of GPU-NIC pair of GPU 0 and 1 is different from other GPUs.

-+-[0000:ff]-+-00.0  Intel Corporation Device 344c
 |           +-00.1  Intel Corporation Device 344c
 |           +-00.2  Intel Corporation Device 344c
 |           +-00.3  Intel Corporation Device 344c
 |           +-00.4  Intel Corporation Device 344c
 |           +-00.5  Intel Corporation Device 344c
 |           +-00.6  Intel Corporation Device 344c
 |           +-00.7  Intel Corporation Device 344c
 |           +-01.0  Intel Corporation Device 344c
 |           +-01.1  Intel Corporation Device 344c
 |           +-01.2  Intel Corporation Device 344c
 |           +-01.3  Intel Corporation Device 344c
 |           +-01.4  Intel Corporation Device 344c
 |           +-01.5  Intel Corporation Device 344c
 |           +-01.6  Intel Corporation Device 344c
 |           +-01.7  Intel Corporation Device 344c
 |           +-02.0  Intel Corporation Device 344c
 |           +-02.1  Intel Corporation Device 344c
 |           +-02.2  Intel Corporation Device 344c
 |           +-02.3  Intel Corporation Device 344c
 |           +-02.4  Intel Corporation Device 344c
 |           +-02.5  Intel Corporation Device 344c
 |           +-02.6  Intel Corporation Device 344c
 |           +-02.7  Intel Corporation Device 344c
 |           +-03.0  Intel Corporation Device 344c
 |           +-03.1  Intel Corporation Device 344c
 |           +-03.2  Intel Corporation Device 344c
 |           +-03.3  Intel Corporation Device 344c
 |           +-03.4  Intel Corporation Device 344c
 |           +-03.5  Intel Corporation Device 344c
 |           +-03.6  Intel Corporation Device 344c
 |           +-03.7  Intel Corporation Device 344c
 |           +-04.0  Intel Corporation Device 344c
 |           +-04.1  Intel Corporation Device 344c
 |           +-04.2  Intel Corporation Device 344c
 |           +-04.3  Intel Corporation Device 344c
 |           +-04.4  Intel Corporation Device 344c
 |           +-04.5  Intel Corporation Device 344c
 |           +-04.6  Intel Corporation Device 344c
 |           +-04.7  Intel Corporation Device 344c
 |           +-0a.0  Intel Corporation Device 344d
 |           +-0a.1  Intel Corporation Device 344d
 |           +-0a.2  Intel Corporation Device 344d
 |           +-0a.3  Intel Corporation Device 344d
 |           +-0a.4  Intel Corporation Device 344d
 |           +-0a.5  Intel Corporation Device 344d
 |           +-0a.6  Intel Corporation Device 344d
 |           +-0a.7  Intel Corporation Device 344d
 |           +-0b.0  Intel Corporation Device 344d
 |           +-0b.1  Intel Corporation Device 344d
 |           +-0b.2  Intel Corporation Device 344d
 |           +-0b.3  Intel Corporation Device 344d
 |           +-0b.4  Intel Corporation Device 344d
 |           +-0b.5  Intel Corporation Device 344d
 |           +-0b.6  Intel Corporation Device 344d
 |           +-0b.7  Intel Corporation Device 344d
 |           +-0c.0  Intel Corporation Device 344d
 |           +-0c.1  Intel Corporation Device 344d
 |           +-0c.2  Intel Corporation Device 344d
 |           +-0c.3  Intel Corporation Device 344d
 |           +-0c.4  Intel Corporation Device 344d
 |           +-0c.5  Intel Corporation Device 344d
 |           +-0c.6  Intel Corporation Device 344d
 |           +-0c.7  Intel Corporation Device 344d
 |           +-0d.0  Intel Corporation Device 344d
 |           +-0d.1  Intel Corporation Device 344d
 |           +-0d.2  Intel Corporation Device 344d
 |           +-0d.3  Intel Corporation Device 344d
 |           +-0d.4  Intel Corporation Device 344d
 |           +-0d.5  Intel Corporation Device 344d
 |           +-0d.6  Intel Corporation Device 344d
 |           +-0d.7  Intel Corporation Device 344d
 |           +-0e.0  Intel Corporation Device 344d
 |           +-0e.1  Intel Corporation Device 344d
 |           +-0e.2  Intel Corporation Device 344d
 |           +-0e.3  Intel Corporation Device 344d
 |           +-0e.4  Intel Corporation Device 344d
 |           +-0e.5  Intel Corporation Device 344d
 |           +-0e.6  Intel Corporation Device 344d
 |           +-0e.7  Intel Corporation Device 344d
 |           +-1d.0  Intel Corporation Device 344f
 |           +-1d.1  Intel Corporation Device 3457
 |           +-1e.0  Intel Corporation Device 3458
 |           +-1e.1  Intel Corporation Device 3459
 |           +-1e.2  Intel Corporation Device 345a
 |           +-1e.3  Intel Corporation Device 345b
 |           +-1e.4  Intel Corporation Device 345c
 |           +-1e.5  Intel Corporation Device 345d
 |           +-1e.6  Intel Corporation Device 345e
 |           \-1e.7  Intel Corporation Device 345f
 +-[0000:fe]-+-00.0  Intel Corporation Device 3450
 |           +-00.1  Intel Corporation Device 3451
 |           +-00.2  Intel Corporation Device 3452
 |           +-00.3  Intel Corporation Device 0998
 |           +-00.5  Intel Corporation Device 3455
 |           +-02.0  Intel Corporation Device 3440
 |           +-02.1  Intel Corporation Device 3441
 |           +-02.2  Intel Corporation Device 3442
 |           +-04.0  Intel Corporation Device 3440
 |           +-04.1  Intel Corporation Device 3441
 |           +-04.2  Intel Corporation Device 3442
 |           +-04.3  Intel Corporation Device 3443
 |           +-05.0  Intel Corporation Device 3445
 |           +-05.1  Intel Corporation Device 3446
 |           +-05.2  Intel Corporation Device 3447
 |           +-06.0  Intel Corporation Device 3445
 |           +-06.1  Intel Corporation Device 3446
 |           +-06.2  Intel Corporation Device 3447
 |           +-07.0  Intel Corporation Device 3445
 |           +-07.1  Intel Corporation Device 3446
 |           +-07.2  Intel Corporation Device 3447
 |           +-0b.0  Intel Corporation Device 3448
 |           +-0b.1  Intel Corporation Device 3448
 |           +-0b.2  Intel Corporation Device 344b
 |           +-0c.0  Intel Corporation Device 344a
 |           +-0d.0  Intel Corporation Device 344a
 |           +-0e.0  Intel Corporation Device 344a
 |           +-0f.0  Intel Corporation Device 344a
 |           +-1a.0  Intel Corporation Device 2880
 |           +-1b.0  Intel Corporation Device 2880
 |           +-1c.0  Intel Corporation Device 2880
 |           \-1d.0  Intel Corporation Device 2880
 +-[0000:f0]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           +-02.0-[f1]--
 |           \-04.0-[f2]--
 +-[0000:db]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           +-02.0-[dc]--
 |           +-03.0-[dd]--
 |           +-04.0-[de]--
 |           \-05.0-[df]--
 +-[0000:ae]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[af-d1]----00.0-[b0-d1]--+-00.0-[b1-b4]----00.0-[b2-b4]--+-00.0-[b3]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           |                               \-10.0-[b4]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           +-04.0-[b5-ba]----00.0-[b6-ba]--+-10.0-[b7]--
 |                                           |                               +-14.0-[b8]--
 |                                           |                               +-18.0-[b9]--
 |                                           |                               \-1c.0-[ba]--
 |                                           +-08.0-[bb-c3]----00.0-[bc-c3]--+-00.0-[bd-c0]----00.0-[be-c0]--+-00.0-[bf]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                               |                               \-1f.0-[c0]----00.0  LSI Logic / Symbios Logic Device 00b2
 |                                           |                               \-10.0-[c1-c3]----00.0-[c2-c3]----00.0-[c3]----00.0  NVIDIA Corporation Device 20b2
 |                                           +-0c.0-[c4-d0]----00.0-[c5-d0]--+-14.0-[c6-cf]----00.0-[c7-cf]----00.0-[c8-cf]----00.0-[c9-cf]--+-01.0-[ca]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               |                                                               +-02.0-[cb]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               |                                                               +-03.0-[cc]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               |                                                               +-04.0-[cd]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               |                                                               +-0b.0-[ce]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               |                                                               \-0c.0-[cf]----00.0  NVIDIA Corporation Device 1af1
 |                                           |                               \-15.0-[d0]--
 |                                           \-1c.0-[d1]----00.0  LSI Logic / Symbios Logic Device c010
 +-[0000:81]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[82-9b]----00.0-[83-9b]--+-00.0-[84-86]----00.0-[85-86]----10.0-[86]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           +-04.0-[87-8d]----00.0-[88-8d]--+-00.0-[89]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           |                               +-10.0-[8a]--
 |                                           |                               +-14.0-[8b]--
 |                                           |                               +-18.0-[8c]--
 |                                           |                               \-1c.0-[8d]--
 |                                           +-08.0-[8e-96]----00.0-[8f-96]--+-00.0-[90-93]----00.0-[91-93]--+-00.0-[92]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                               |                               \-1f.0-[93]----00.0  LSI Logic / Symbios Logic Device 00b2
 |                                           |                               \-10.0-[94-96]----00.0-[95-96]----00.0-[96]----00.0  NVIDIA Corporation Device 20b2
 |                                           +-0c.0-[97-9a]----00.0-[98-9a]--+-14.0-[99]--
 |                                           |                               \-15.0-[9a]--
 |                                           \-1c.0-[9b]----00.0  LSI Logic / Symbios Logic Device c010
 +-[0000:80]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           +-01.0  Intel Corporation Device 0b00
 |           +-01.1  Intel Corporation Device 0b00
 |           +-01.2  Intel Corporation Device 0b00
 |           +-01.3  Intel Corporation Device 0b00
 |           +-01.4  Intel Corporation Device 0b00
 |           +-01.5  Intel Corporation Device 0b00
 |           +-01.6  Intel Corporation Device 0b00
 |           +-01.7  Intel Corporation Device 0b00
 |           +-02.0  Intel Corporation Device 09a6
 |           +-02.1  Intel Corporation Device 09a7
 |           \-02.4  Intel Corporation Device 3456
 +-[0000:7f]-+-00.0  Intel Corporation Device 344c
 |           +-00.1  Intel Corporation Device 344c
 |           +-00.2  Intel Corporation Device 344c
 |           +-00.3  Intel Corporation Device 344c
 |           +-00.4  Intel Corporation Device 344c
 |           +-00.5  Intel Corporation Device 344c
 |           +-00.6  Intel Corporation Device 344c
 |           +-00.7  Intel Corporation Device 344c
 |           +-01.0  Intel Corporation Device 344c
 |           +-01.1  Intel Corporation Device 344c
 |           +-01.2  Intel Corporation Device 344c
 |           +-01.3  Intel Corporation Device 344c
 |           +-01.4  Intel Corporation Device 344c
 |           +-01.5  Intel Corporation Device 344c
 |           +-01.6  Intel Corporation Device 344c
 |           +-01.7  Intel Corporation Device 344c
 |           +-02.0  Intel Corporation Device 344c
 |           +-02.1  Intel Corporation Device 344c
 |           +-02.2  Intel Corporation Device 344c
 |           +-02.3  Intel Corporation Device 344c
 |           +-02.4  Intel Corporation Device 344c
 |           +-02.5  Intel Corporation Device 344c
 |           +-02.6  Intel Corporation Device 344c
 |           +-02.7  Intel Corporation Device 344c
 |           +-03.0  Intel Corporation Device 344c
 |           +-03.1  Intel Corporation Device 344c
 |           +-03.2  Intel Corporation Device 344c
 |           +-03.3  Intel Corporation Device 344c
 |           +-03.4  Intel Corporation Device 344c
 |           +-03.5  Intel Corporation Device 344c
 |           +-03.6  Intel Corporation Device 344c
 |           +-03.7  Intel Corporation Device 344c
 |           +-04.0  Intel Corporation Device 344c
 |           +-04.1  Intel Corporation Device 344c
 |           +-04.2  Intel Corporation Device 344c
 |           +-04.3  Intel Corporation Device 344c
 |           +-04.4  Intel Corporation Device 344c
 |           +-04.5  Intel Corporation Device 344c
 |           +-04.6  Intel Corporation Device 344c
 |           +-04.7  Intel Corporation Device 344c
 |           +-0a.0  Intel Corporation Device 344d
 |           +-0a.1  Intel Corporation Device 344d
 |           +-0a.2  Intel Corporation Device 344d
 |           +-0a.3  Intel Corporation Device 344d
 |           +-0a.4  Intel Corporation Device 344d
 |           +-0a.5  Intel Corporation Device 344d
 |           +-0a.6  Intel Corporation Device 344d
 |           +-0a.7  Intel Corporation Device 344d
 |           +-0b.0  Intel Corporation Device 344d
 |           +-0b.1  Intel Corporation Device 344d
 |           +-0b.2  Intel Corporation Device 344d
 |           +-0b.3  Intel Corporation Device 344d
 |           +-0b.4  Intel Corporation Device 344d
 |           +-0b.5  Intel Corporation Device 344d
 |           +-0b.6  Intel Corporation Device 344d
 |           +-0b.7  Intel Corporation Device 344d
 |           +-0c.0  Intel Corporation Device 344d
 |           +-0c.1  Intel Corporation Device 344d
 |           +-0c.2  Intel Corporation Device 344d
 |           +-0c.3  Intel Corporation Device 344d
 |           +-0c.4  Intel Corporation Device 344d
 |           +-0c.5  Intel Corporation Device 344d
 |           +-0c.6  Intel Corporation Device 344d
 |           +-0c.7  Intel Corporation Device 344d
 |           +-0d.0  Intel Corporation Device 344d
 |           +-0d.1  Intel Corporation Device 344d
 |           +-0d.2  Intel Corporation Device 344d
 |           +-0d.3  Intel Corporation Device 344d
 |           +-0d.4  Intel Corporation Device 344d
 |           +-0d.5  Intel Corporation Device 344d
 |           +-0d.6  Intel Corporation Device 344d
 |           +-0d.7  Intel Corporation Device 344d
 |           +-0e.0  Intel Corporation Device 344d
 |           +-0e.1  Intel Corporation Device 344d
 |           +-0e.2  Intel Corporation Device 344d
 |           +-0e.3  Intel Corporation Device 344d
 |           +-0e.4  Intel Corporation Device 344d
 |           +-0e.5  Intel Corporation Device 344d
 |           +-0e.6  Intel Corporation Device 344d
 |           +-0e.7  Intel Corporation Device 344d
 |           +-1d.0  Intel Corporation Device 344f
 |           +-1d.1  Intel Corporation Device 3457
 |           +-1e.0  Intel Corporation Device 3458
 |           +-1e.1  Intel Corporation Device 3459
 |           +-1e.2  Intel Corporation Device 345a
 |           +-1e.3  Intel Corporation Device 345b
 |           +-1e.4  Intel Corporation Device 345c
 |           +-1e.5  Intel Corporation Device 345d
 |           +-1e.6  Intel Corporation Device 345e
 |           \-1e.7  Intel Corporation Device 345f
 +-[0000:7e]-+-00.0  Intel Corporation Device 3450
 |           +-00.1  Intel Corporation Device 3451
 |           +-00.2  Intel Corporation Device 3452
 |           +-00.3  Intel Corporation Device 0998
 |           +-00.5  Intel Corporation Device 3455
 |           +-02.0  Intel Corporation Device 3440
 |           +-02.1  Intel Corporation Device 3441
 |           +-02.2  Intel Corporation Device 3442
 |           +-04.0  Intel Corporation Device 3440
 |           +-04.1  Intel Corporation Device 3441
 |           +-04.2  Intel Corporation Device 3442
 |           +-04.3  Intel Corporation Device 3443
 |           +-05.0  Intel Corporation Device 3445
 |           +-05.1  Intel Corporation Device 3446
 |           +-05.2  Intel Corporation Device 3447
 |           +-06.0  Intel Corporation Device 3445
 |           +-06.1  Intel Corporation Device 3446
 |           +-06.2  Intel Corporation Device 3447
 |           +-07.0  Intel Corporation Device 3445
 |           +-07.1  Intel Corporation Device 3446
 |           +-07.2  Intel Corporation Device 3447
 |           +-0b.0  Intel Corporation Device 3448
 |           +-0b.1  Intel Corporation Device 3448
 |           +-0b.2  Intel Corporation Device 344b
 |           +-0c.0  Intel Corporation Device 344a
 |           +-0d.0  Intel Corporation Device 344a
 |           +-0e.0  Intel Corporation Device 344a
 |           +-0f.0  Intel Corporation Device 344a
 |           +-1a.0  Intel Corporation Device 2880
 |           +-1b.0  Intel Corporation Device 2880
 |           +-1c.0  Intel Corporation Device 2880
 |           \-1d.0  Intel Corporation Device 2880
 +-[0000:69]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           +-02.0-[6a]----00.0  Intel Corporation Device 0b60
 |           +-03.0-[6b]----00.0  Intel Corporation Device 0b60
 |           +-04.0-[6c]--
 |           \-05.0-[6d]--
 +-[0000:63]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[64]--+-00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                        \-00.1  Mellanox Technologies MT28908 Family [ConnectX-6]
 +-[0000:36]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[37-50]----00.0-[38-50]--+-00.0-[39-3c]----00.0-[3a-3c]--+-00.0-[3b]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           |                               \-10.0-[3c]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           +-04.0-[3d-42]----00.0-[3e-42]----10.0-[3f-42]----00.0-[40-42]--+-00.0-[41]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                                                               \-1f.0-[42]----00.0  LSI Logic / Symbios Logic Device 00b2
 |                                           +-08.0-[43-4b]----00.0-[44-4b]--+-00.0-[45-47]----00.0-[46-47]----00.0-[47]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                               +-10.0-[48]--
 |                                           |                               +-14.0-[49]--
 |                                           |                               +-18.0-[4a]--
 |                                           |                               \-1c.0-[4b]--
 |                                           +-0c.0-[4c-4f]----00.0-[4d-4f]--+-14.0-[4e]--
 |                                           |                               \-15.0-[4f]--
 |                                           \-1c.0-[50]----00.0  LSI Logic / Symbios Logic Device c010
 +-[0000:09]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[0a-23]----00.0-[0b-23]--+-00.0-[0c-0e]----00.0-[0d-0e]----10.0-[0e]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           +-04.0-[0f-15]----00.0-[10-15]--+-00.0-[11]----00.0  Mellanox Technologies MT28908 Family [ConnectX-6]
 |                                           |                               \-10.0-[12-15]----00.0-[13-15]--+-00.0-[14]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                                                               \-1f.0-[15]----00.0  LSI Logic / Symbios Logic Device 00b2
 |                                           +-08.0-[16-1e]----00.0-[17-1e]--+-00.0-[18-1a]----00.0-[19-1a]----00.0-[1a]----00.0  NVIDIA Corporation Device 20b2
 |                                           |                               +-10.0-[1b]--
 |                                           |                               +-14.0-[1c]--
 |                                           |                               +-18.0-[1d]--
 |                                           |                               \-1c.0-[1e]--
 |                                           +-0c.0-[1f-22]----00.0-[20-22]--+-14.0-[21]--
 |                                           |                               \-15.0-[22]--
 |                                           \-1c.0-[23]----00.0  LSI Logic / Symbios Logic Device c010
 \-[0000:00]-+-00.0  Intel Corporation Device 09a2
             +-00.1  Intel Corporation Device 09a4
             +-00.2  Intel Corporation Device 09a3
             +-00.4  Intel Corporation Device 0998
             +-01.0  Intel Corporation Device 0b00
             +-01.1  Intel Corporation Device 0b00
             +-01.2  Intel Corporation Device 0b00
             +-01.3  Intel Corporation Device 0b00
             +-01.4  Intel Corporation Device 0b00
             +-01.5  Intel Corporation Device 0b00
             +-01.6  Intel Corporation Device 0b00
             +-01.7  Intel Corporation Device 0b00
             +-02.0  Intel Corporation Device 09a6
             +-02.1  Intel Corporation Device 09a7
             +-02.4  Intel Corporation Device 3456
             +-11.0  Intel Corporation C620 Series Chipset Family MROM 0
             +-11.5  Intel Corporation C620 Series Chipset Family SSATA Controller [AHCI mode]
             +-14.0  Intel Corporation C620 Series Chipset Family USB 3.0 xHCI Controller
             +-14.2  Intel Corporation C620 Series Chipset Family Thermal Subsystem
             +-16.0  Intel Corporation C620 Series Chipset Family MEI Controller #1
             +-16.1  Intel Corporation C620 Series Chipset Family MEI Controller #2
             +-16.4  Intel Corporation C620 Series Chipset Family MEI Controller #3
             +-17.0  Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode]
             +-1c.0-[01]--
             +-1c.4-[02-03]----00.0-[03]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
             +-1c.5-[04]--
             +-1f.0  Intel Corporation Device a1cb
             +-1f.2  Intel Corporation C620 Series Chipset Family Power Management Controller
             +-1f.3  Intel Corporation Lewisburg MROM 0
             +-1f.4  Intel Corporation C620 Series Chipset Family SMBus
             \-1f.5  Intel Corporation C620 Series Chipset Family SPI Controller

Here is the NCCL logs when I use the environment variable export NCCL_DEBUG=INFO.

GPU 0:

my_machine_name:22473:22473 [0] NCCL INFO Bootstrap : Using xgbe0:my_ip_address<0>
my_machine_name:22473:22473 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
my_machine_name:22473:22473 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
my_machine_name:22473:22473 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/RoCE [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [RO]; OOB xgbe0:my_ip_address<0>
my_machine_name:22473:22473 [0] NCCL INFO Using network IB
W0411 21:51:24.834267 22473 gpu_resources.cc:85] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.8, Runtime API Version: 11.8
W0411 21:51:24.907069 22473 gpu_resources.cc:115] device: 0, cuDNN Version: 8.6.
NCCL version 2.12.10+cuda11.8
my_machine_name:22473:22473 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
my_machine_name:22473:22473 [0] NCCL INFO NCCL_CROSS_NIC set by environment to 0.
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/16 :    0   9  15  14  13  12  11  10   8  17  23  22  21  20  19  18  16  25  31  30
my_machine_name:22473:22473 [0] NCCL INFO Channel 02/16 :    0   3  10  15  14  13  12   9   8  11  18  23  22  21  20  17  16  19  26  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 03/16 :    0   2  11  15  14  13  12   9   8  10  19  23  22  21  20  17  16  18  27  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 04/16 :    0   7   6   5  12  11  10   9   8  15  14  13  20  19  18  17  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 05/16 :    0   7   6   4  13  11  10   9   8  15  14  12  21  19  18  17  16  23  22  20
my_machine_name:22473:22473 [0] NCCL INFO Channel 06/16 :    0   5   4   7  14  11  10   9   8  13  12  15  22  19  18  17  16  21  20  23
my_machine_name:22473:22473 [0] NCCL INFO Channel 07/16 :    0   5   4   6  15  11  10   9   8  13  12  14  23  19  18  17  16  21  20  22
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 09/16 :    0   9  15  14  13  12  11  10   8  17  23  22  21  20  19  18  16  25  31  30
my_machine_name:22473:22473 [0] NCCL INFO Channel 10/16 :    0   3  10  15  14  13  12   9   8  11  18  23  22  21  20  17  16  19  26  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 11/16 :    0   2  11  15  14  13  12   9   8  10  19  23  22  21  20  17  16  18  27  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 12/16 :    0   7   6   5  12  11  10   9   8  15  14  13  20  19  18  17  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 13/16 :    0   7   6   4  13  11  10   9   8  15  14  12  21  19  18  17  16  23  22  20
my_machine_name:22473:22473 [0] NCCL INFO Channel 14/16 :    0   5   4   7  14  11  10   9   8  13  12  15  22  19  18  17  16  21  20  23
my_machine_name:22473:22473 [0] NCCL INFO Channel 15/16 :    0   5   4   6  15  11  10   9   8  13  12  14  23  19  18  17  16  21  20  22
my_machine_name:22473:22473 [0] NCCL INFO Trees [0] 1/16/-1->0->-1 [1] 2/-1/-1->0->1 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->6 [8] 1/-1/-1->0->8 [9] 2/-1/-1->0->1 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->6
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 3[47000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 3[47000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 04 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 05 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 12 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 13 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22690 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 0.
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 09/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Connected all rings
my_machine_name:22473:22473 [0] NCCL INFO Channel 00 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 04 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 05 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 09 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 12 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 13 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 09 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 0[14000] -> 8[14000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 16[1b000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 0[14000] -> 16[1b000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 8[14000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Connected all trees
my_machine_name:22473:22473 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512
my_machine_name:22473:22473 [0] NCCL INFO 16 coll channels, 16 p2p channels, 2 p2p channels per peer
my_machine_name:22473:22473 [0] NCCL INFO comm 0x430f1790 rank 0 nranks 32 cudaDev 0 busId 14000 - Init COMPLETE
my_machine_name:22473:22473 [0] NCCL INFO Launch mode Parallel
my_machine_name:22473:22473 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/16 :    0   9  15  14  13  12  11  10   8  17  23  22  21  20  19  18  16  25  31  30
my_machine_name:22473:22473 [0] NCCL INFO Channel 02/16 :    0   3  10  15  14  13  12   9   8  11  18  23  22  21  20  17  16  19  26  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 03/16 :    0   2  11  15  14  13  12   9   8  10  19  23  22  21  20  17  16  18  27  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 04/16 :    0   7   6   5  12  11  10   9   8  15  14  13  20  19  18  17  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 05/16 :    0   7   6   4  13  11  10   9   8  15  14  12  21  19  18  17  16  23  22  20
my_machine_name:22473:22473 [0] NCCL INFO Channel 06/16 :    0   5   4   7  14  11  10   9   8  13  12  15  22  19  18  17  16  21  20  23
my_machine_name:22473:22473 [0] NCCL INFO Channel 07/16 :    0   5   4   6  15  11  10   9   8  13  12  14  23  19  18  17  16  21  20  22
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 09/16 :    0   9  15  14  13  12  11  10   8  17  23  22  21  20  19  18  16  25  31  30
my_machine_name:22473:22473 [0] NCCL INFO Channel 10/16 :    0   3  10  15  14  13  12   9   8  11  18  23  22  21  20  17  16  19  26  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 11/16 :    0   2  11  15  14  13  12   9   8  10  19  23  22  21  20  17  16  18  27  31
my_machine_name:22473:22473 [0] NCCL INFO Channel 12/16 :    0   7   6   5  12  11  10   9   8  15  14  13  20  19  18  17  16  23  22  21
my_machine_name:22473:22473 [0] NCCL INFO Channel 13/16 :    0   7   6   4  13  11  10   9   8  15  14  12  21  19  18  17  16  23  22  20
my_machine_name:22473:22473 [0] NCCL INFO Channel 14/16 :    0   5   4   7  14  11  10   9   8  13  12  15  22  19  18  17  16  21  20  23
my_machine_name:22473:22473 [0] NCCL INFO Channel 15/16 :    0   5   4   6  15  11  10   9   8  13  12  14  23  19  18  17  16  21  20  22
my_machine_name:22473:22473 [0] NCCL INFO Trees [0] 1/16/-1->0->-1 [1] 2/-1/-1->0->1 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->6 [8] 1/-1/-1->0->8 [9] 2/-1/-1->0->1 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->6
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 3[47000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 3[47000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 5[96000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 04 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 05 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 12 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 13 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 09/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Connected all rings
my_machine_name:22473:22473 [0] NCCL INFO Channel 00 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 04 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 05 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 09 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 12 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 13 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 1[1a000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 09 : 0[14000] -> 2[41000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 07 : 0[14000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 15 : 0[14000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 02 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 03 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 06 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 10 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 11 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 14 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 0[14000] -> 8[14000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 16[1b000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 0[14000] -> 16[1b000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 8[14000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Connected all trees
my_machine_name:22473:22473 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512
my_machine_name:22473:22473 [0] NCCL INFO 16 coll channels, 16 p2p channels, 2 p2p channels per peer
my_machine_name:22473:22473 [0] NCCL INFO comm 0x43ffeea0 rank 0 nranks 32 cudaDev 0 busId 14000 - Init COMPLETE
my_machine_name:22473:22473 [0] NCCL INFO Launch mode Parallel
[2023-04-11 21:51:48,977] [    INFO] topology.py:216 - HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 32, mp_group: [0],  sharding_group: [0], pp_group: [0], dp_group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], check/clip group: [0]
my_machine_name:22473:22473 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/02 :    0   1
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/02 :    0   1
my_machine_name:22473:22473 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 1[14000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/0 : 1[14000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 0[14000] -> 1[14000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/0 : 0[14000] -> 1[14000] [send] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Connected all rings
my_machine_name:22473:22473 [0] NCCL INFO Connected all trees
my_machine_name:22473:22473 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
my_machine_name:22473:22473 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
my_machine_name:22473:22473 [0] NCCL INFO comm 0x44f02310 rank 0 nranks 2 cudaDev 0 busId 14000 - Init COMPLETE
my_machine_name:22473:22473 [0] NCCL INFO Launch mode Parallel
Warmup ends
node_id 0 local_rank 0 5313036288 : 0.43967132568359374
my_machine_name:22473:22710 [0] NCCL INFO [Service thread] Connection closed by localRank 5
my_machine_name:22473:22690 [0] NCCL INFO [Service thread] Connection closed by localRank 5
I0411 21:52:19.751660 22592 tcp_store.cc:270] receive shutdown event and so quit from MasterDaemon run loop
my_machine
8000
_name:22473:22710 [32725] NCCL INFO [Service thread] Connection closed by localRank 6
my_machine_name:22473:22690 [32725] NCCL INFO [Service thread] Connection closed by localRank 6

GPU 7

my_machine_name:22489:22489 [7] NCCL INFO Bootstrap : Using xgbe0:my_ip_address<0>
my_machine_name:22489:22489 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
my_machine_name:22489:22489 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
my_machine_name:22489:22489 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/RoCE [5]mlx5_6:1/IB [6]mlx5_7:1/IB [7]mlx5_8:1/IB [8]mlx5_9:1/IB [RO]; OOB xgbe0:my_ip_address<0>
my_machine_name:22489:22489 [7] NCCL INFO Using network IB
my_machine_name:22489:22489 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
my_machine_name:22489:22489 [7] NCCL INFO NCCL_CROSS_NIC set by environment to 0.
my_machine_name:22489:22489 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] 0/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 6/23/-1->7->-1 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] 0/-1/-1->7->6 [11] 0/-1/-1->7->6 [12] 0/-1/-1->7->6 [13] 0/-1/-1->7->6 [14] 0/-1/-1->7->6 [15] 6/-1/-1->7->15
my_machine_name:22489:22489 [7] NCCL INFO Channel 06/0 : 7[c3000] -> 14[bf000] [send] via NET/IB/7/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 14/0 : 7[c3000] -> 14[bf000] [send] via NET/IB/7/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 30[c1000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 30[c1000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22691 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 0.
my_machine_name:22489:22489 [7] NCCL INFO Channel 07 : 7[c3000] -> 3[47000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 15 : 7[c3000] -> 3[47000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 00 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 01 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 02 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 03 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 04 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 05 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 08 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 09 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 10 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 11 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 12 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 13 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Connected all rings
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 7[c3000] -> 15[c3000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 23[c6000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 7[c3000] -> 23[c6000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 15[c3000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 02 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 03 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 04 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 05 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 06 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 10 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 11 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 12 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 13 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 14 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 06 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 07 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 14 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 15 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Connected all trees
my_machine_name:22489:22489 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512
my_machine_name:22489:22489 [7] NCCL INFO 16 coll channels, 16 p2p channels, 2 p2p channels per peer
my_machine_name:22489:22489 [7] NCCL INFO comm 0x40f87510 rank 7 nranks 32 cudaDev 7 busId c3000 - Init COMPLETE
my_machine_name:22489:22489 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
my_machine_name:22489:22489 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] 0/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 6/23/-1->7->-1 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] 0/-1/-1->7->6 [11] 0/-1/-1->7->6 [12] 0/-1/-1->7->6 [13] 0/-1/-1->7->6 [14] 0/-1/-1->7->6 [15] 6/-1/-1->7->15
my_machine_name:22489:22489 [7] NCCL INFO Channel 06/0 : 7[c3000] -> 14[bf000] [send] via NET/IB/7/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 14/0 : 7[c3000] -> 14[bf000] [send] via NET/IB/7/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 30[c1000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 30[c1000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07 : 7[c3000] -> 3[47000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 15 : 7[c3000] -> 3[47000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 00 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 01 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 02 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 03 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 04 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 05 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 08 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 09 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 10 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 11 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 12 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 13 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Connected all rings
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 7[c3000] -> 15[c3000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 23[c6000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 07/0 : 7[c3000] -> 23[c6000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 15/0 : 15[c3000] -> 7[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 02 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 03 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 04 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 05 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 06 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 10 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 11 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 12 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 13 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 14 : 7[c3000] -> 0[14000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 06 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 07 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 14 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Channel 15 : 7[c3000] -> 6[bf000] via P2P/IPC/read
my_machine_name:22489:22489 [7] NCCL INFO Connected all trees
my_machine_name:22489:22489 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 8/8/512
my_machine_name:22489:22489 [7] NCCL INFO 16 coll channels, 16 p2p channels, 2 p2p channels per peer
my_machine_name:22489:22489 [7] NCCL INFO comm 0x421d8f20 rank 7 nranks 32 cudaDev 7 busId c3000 - Init COMPLETE
[2023-04-11 21:51:48,977] [    INFO] topology.py:216 - HybridParallelInfo: rank_id: 7, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 32, mp_group: [7],  sharding_group: [7], pp_group: [7], dp_group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], check/clip group: [7]
NCCL version 2.12.10+cuda11.8
my_machine_name:22489:22489 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
my_machine_name:22489:22489 [7] NCCL INFO Channel 00/02 :    0   1
my_machine_name:22489:22489 [7] NCCL INFO Channel 01/02 :    0   1
my_machine_name:22489:22489 [7] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
my_machine_name:22489:22489 [7] NCCL INFO Channel 00/0 : 1[c3000] -> 0[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 01/0 : 1[c3000] -> 0[c3000] [receive] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 00/0 : 0[c3000] -> 1[c3000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Channel 01/0 : 0[c3000] -> 1[c3000] [send] via NET/IB/8/GDRDMA
my_machine_name:22489:22489 [7] NCCL INFO Connected all rings
my_machine_name:22489:22489 [7] NCCL INFO Connected all trees
my_machine_name:22489:22489 [7] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
my_machine_name:22489:22489 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
my_machine_name:22489:22489 [7] NCCL INFO comm 0x4308e570 rank 0 nranks 2 cudaDev 7 busId c3000 - Init COMPLETE
my_machine_name:22489:22489 [7] NCCL INFO Launch mode Parallel
Warmup ends
node_id 0 local_rank 7 5313036288 : 0.2761931705474854
my_machine_name:22489:22709 [32636] NCCL INFO [Service thread] Connection closed by localRank 5
my_machine_name:22489:22691 [32637] NCCL INFO [Service thread] Connection closed by localRank 5

We found that GPU 0 are connected with two IB NICs in the NCCL logs. May this be the reason why GPU 0 and GPU 1 is slower than others?

my_machine_name:22473:22473 [0] NCCL INFO Channel 00/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 08/0 : 25[21000] -> 0[14000] [receive] via NET/IB/1/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 00 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 04 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 05 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 08 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 12 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 13 : 0[14000] -> 7[c3000] via P2P/IPC/read
my_machine_name:22473:22473 [0] NCCL INFO Channel 01/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA
my_machine_name:22473:22473 [0] NCCL INFO Channel 09/0 : 0[14000] -> 9[1a000] [send] via NET/IB/0/GDRDMA

Please give us some suggestions on how to solve this problems. Thank you.

The text was updated successfully, but these errors were encountered:

sjeaugey · 2023-04-11T15:53:10Z

It could be. Could you set NCCL_IB_HCA to only keep 8 NICs (1 per GPU) and see whether it solves the issue?

sneaxiy · 2023-04-11T16:01:57Z

@sjeaugey Thanks for you reply. Would you please tell me how to set NCCL_IB_HCA for each GPU/NIC? My topo is:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4       mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     NODE    NODE    NODENODE     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     NODE    NODE    NODENODE     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    NODE    NODE    PXB     PXB     NODENODE     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    NODE    NODE    PXB     PXB     NODENODE     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS SYS      PXB     PXB     NODE    NODE    32-63,96-127    1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS SYS      PXB     PXB     NODE    NODE    32-63,96-127    1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS SYS      NODE    NODE    PXB     PXB     32-63,96-127    1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS SYS      NODE    NODE    PXB     PXB     32-63,96-127    1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    NODENODE     SYS     SYS     SYS     SYS
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    NODENODE     SYS     SYS     SYS     SYS
mlx5_2  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     NODENODE     SYS     SYS     SYS     SYS
mlx5_3  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      NODENODE     SYS     SYS     SYS     SYS
mlx5_4  NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X  PIX      SYS     SYS     SYS     SYS
mlx5_5  NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX  X       SYS     SYS     SYS     SYS
mlx5_6  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS SYS       X      PXB     NODE    NODE
mlx5_7  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS SYS      PXB      X      NODE    NODE
mlx5_8  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS SYS      NODE    NODE     X      PIX
mlx5_9  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS SYS      NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

sjeaugey · 2023-04-11T16:22:16Z

Actually it seems there is no difference between GPUs 0/1 and others. Each pair of GPUs (0-1, 2-3, 4-5 and 6-7) all have 2 local NICs. I see nothing unusual here and it should work fine.

What could happen here is that the two slow flows (node 0 GPU 0 <-> node 1 GPU 0 and node 0 GPU 1 <-> node 1 GPU 1) end up using the same network link at some point in the fabric. Are the two nodes connected to the same switch or different switches through a multi-level network fabric? Do you have a rail-optimized fabric (different switch per NIC). If not, are you using adaptive routing? Without a rail-optimized fabric and without adaptive routing, these kind of route collision is quite common.

f2hkop · 2023-04-11T16:30:03Z

@sjeaugey we have rail-optimized fabric and we use the similar network solution as selene. We doubt if using both local nics per gpu might bring some collision locally. Is there any way to limit only one local nic using per GPU?

Actually it seems there is no difference between GPUs 0/1 and others. Each pair of GPUs (0-1, 2-3, 4-5 and 6-7) all have 2 local NICs. I see nothing unusual here and it should work fine.

What could happen here is that the two slow flows (node 0 GPU 0 <-> node 1 GPU 0 and node 0 GPU 1 <-> node 1 GPU 1) end up using the same network link at some point in the fabric. Are the two nodes connected to the same switch or different switches through a multi-level network fabric? Do you have a rail-optimized fabric (different switch per NIC). If not, are you using adaptive routing? Without a rail-optimized fabric and without adaptive routing, these kind of route collision is quite common.

sjeaugey · 2023-04-11T20:00:49Z

Ok thanks.

The fact GPU 0 uses both NICs is normal. If you dump the rings topology with NCCL_GRAPH_DUMP_FILE=graph.xml you should see that rings looks like:

Channel 0 : NET 0 -> GPU 0 -> ... all other GPUs ... -> GPU 1 -> NET 0
Channel 1 : NET 1 -> GPU 1 -> ... all other GPUs ... -> GPU 0 -> NET 1
...

So it's normal for GPU 0 to communicate with NET 0 and NET 1; that's how we guarantee traffic stays local to each rail on a communicator with more than one GPU per node.

The logs you pasted here are for the communicator with all GPUs (16 channels, using all 8 NICs). Could you get the log for the communicator with only one GPU per node, and check that communicator is indeed using a different NIC on each GPU?

f2hkop · 2023-04-12T13:31:08Z

@sjeaugey Thanks for your advice.
We have located the problem. In some nodes GPU 0 and GPU 1 has used the same NIC 1. The topology sort of NIC 0 and NIC 1 is changed from GPU 0 's view.GPU 0 think NIC 1 is closer and GPU 1 think NIC 0 is closer. It is caused by the virtual pci-to-pci bridge of Gen4 Switch PEX88096，leading to wrong result of topology xml build. We are going to contact the vendor for a BIOS patch to fix it.
We read the relative code in NCCL and found it hard to fix it by ourselves in a short time. So we modify the topology file to work around. If you have better idea pleas let us know :)

Ok thanks.

The fact GPU 0 uses both NICs is normal. If you dump the rings topology with NCCL_GRAPH_DUMP_FILE=graph.xml you should see that rings looks like:
Channel 0 : NET 0 -> GPU 0 -> ... all other GPUs ... -> GPU 1 -> NET 0
Channel 1 : NET 1 -> GPU 1 -> ... all other GPUs ... -> GPU 0 -> NET 1
...
So it's normal for GPU 0 to communicate with NET 0 and NET 1; that's how we guarantee traffic stays local to each rail on a communicator with more than one GPU per node.

The logs you pasted here are for the communicator with all GPUs (16 channels, using all 8 NICs). Could you get the log for the communicator with only one GPU per node, and check that communicator is indeed using a different NIC on each GPU?

sjeaugey · 2023-04-13T08:23:06Z

Ok, interesting.

To better understand what's going on, it would be helpful if you could run with 8 GPUs per node, set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH NCCL_TOPO_DUMP_FILE=system.xml and share both the log and the topology XML file.

The NCCL code is supposed to already detect PEX PCI switches and flatten the PCI topology, so it should not be a problem in theory. It could be a NIC ordering issue, or something else ... which the log would tell me hopefully.

f2hkop · 2023-04-14T03:28:56Z

Sure. Just get 2 nodes available.
nccl_2n8g.log
system.txt

sjeaugey · 2023-04-14T09:15:13Z

It seems this is using NCCL 2.10 which is not detecting the topology correctly:

KV Convert to int : could not find value of '16.0 GT/s PCIe' in dictionary, falling back to 60

Previous logs were using NCCL 2.12, it would be good to use a recent enough version.

Regardless of that, indeed mlx5_1 and mlx5_0 are not enumerated in the PCI order. If that's the way it is enumerated on all nodes, it should not be a problem, but if it is not consistent, it could indeed explain the lower performance.

Can you confirm whether whether nodes have different NIC enumeration order or not?

If not, can you check whether reordering the NICs in the XML with mlx5_0 first (and re-injecting the XML with NCCL_TOPO_FILE=system-reordered.xml) makes any difference for the multi-allreduce test? Your message above seems to suggest it does, but I'm not sure I understood that correctly.

f2hkop · 2023-04-14T10:42:40Z

We have tried NCCL 2.15 in job environment. It's the same result with 2.10. It happens when adding NICs to xml and seems no relevant difference between the versions.

If not, can you check whether reordering the NICs in the XML with mlx5_0 first (and re-injecting the XML with NCCL_TOPO_FILE=system-reordered.xml) makes any difference for the multi-allreduce test? Your message above seems to suggest it does, but I'm not sure I understood that correctly.

Yes, we fix it by reordering the NICs and it works. It's proved to be OK for a larger scale of 16 nodes. We would launch about hundreds of nodes later.
I attach the single GPU all reduce test logs, topo files and lspci ouput without adjustment below for your information. Just rename it with ".tar" suffix and then extract it.
nccl_topo_bug.log

Regardless of that, indeed mlx5_1 and mlx5_0 are not enumerated in the PCI order. If that's the way it is enumerated on all nodes

The NICs are all enumerated in the PCI order in system. But NCCL doesn't seem to sort it that way and the reason is the irregular topology brought by PEX88096 PCIe switch. In this case It causes NIC 0 not used while doing multi-allreduce test, So the achieved communication bandwidth is halved in scenarios where we do model parallelism intra-node and pipeline or data parallelism inter-node.

sjeaugey · 2023-04-14T15:18:12Z

I'm failing to see why NICs get added in a reverse order. Can you get a log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,NET?

sjeaugey · 2023-04-14T15:39:39Z

Actually nevermind, I figured out why that happens. It's due to the fact mlx5_1 is on the same switch as GPU 0, and since we add GPUs first and NICs after that, after the PEX switch is flattened, mlx5_1 ends up before mlx5_0.

Now I need to figure out why it is a problem and whether we should reorder them.

sjeaugey · 2023-04-15T08:46:12Z

After investigating, I still don't see why it would be a problem. When we create a communicator with 1 GPU per node, GPU 0 should get NIC 1 and GPU 1 should get NIC 0. Would you be able to provide the log for GPU 1? The only log was for GPU 0 and it was using NIC 1 as expected.

f2hkop · 2023-04-16T07:22:31Z

When we create a communicator with 1 GPU per node, GPU 0 should get NIC 1 and GPU 1 should get NIC 0. Would you be able to provide the log for GPU 1

I attach the log of nccl-tests of running with 1 gpu in turn by setting CUDA_VISIBLE_DEVICES=${gpu_index} below. It doesn't contain the specific channel logs but the graph logs. They all print GPU 0, but please notice the pcie bus id has changed.

nccl_2n1g_test.log

f2hkop · 2023-04-16T07:26:20Z

I don't have stable access to available machines. I'll get more clear logs next time if you need.

sjeaugey · 2023-04-17T11:39:49Z

Can you run again with a recent version of NCCL? The NIC selection code for rings was changed in NCCL 2.11 to make two GPUs on the same PCI switch with 2 NICs not use the same NIC.

Your original bug description was mentioning NCCL 2.10, 2.12 and 2.15. Please use NCCL 2.15 if you can, 2.12 worst case, but do not use NCCL 2.10, because NCCL 2.10 is supposed to show the performance issue you are reporting.

f2hkop · 2023-05-06T16:00:07Z

@sjeaugey Just got chance to get the logs you want. Please replace ".log" with ".tar" and then extract it.
debug.log

f2hkop · 2023-05-06T16:13:01Z

We found that it not only causes bad NIC choice on parallel 1-gpu all reduce, but also degrade the performance on 8-gpu all reduce situation. Even if the chosen 2 machines are placed in the same unit(same TOR), the test bandwidth doesn't grow up smoothly while we gradually increase the message size.

sjeaugey · 2023-05-09T09:07:14Z

Ok, problem understood. When we run only with GPU 1, then mlx5_1 is no longer on the same switch as a GPU which was added before, and the reordering does not occur.

Now we need to find out how to end up with a consistent topology graph, regardless of the GPUs which are part of the communicator.

sjeaugey · 2023-05-09T11:52:00Z

Can you check whether the issue is fixed with the attached patch (applies on top of 2.18; would probably also work with previous versions).

patch.txt

Thanks!

f2hkop · 2023-05-10T04:17:57Z

OK. I'll verify it when test resources are ready.

f2hkop · 2023-05-30T14:14:26Z

nccl_test.txt
Thanks, it's fixed! We test it on v2.18 and it works as we expected.

Fix NVLS search (issue #931). Increase max IB NICs to 32. Fix inconsistent device ordering (issue #820). Try to use different devices for different GPUs in systems with more than one NIC per GFU.

Fix NVLS search (issue NVIDIA#931). Increase max IB NICs to 32. Fix inconsistent device ordering (issue NVIDIA#820). Try to use different devices for different GPUs in systems with more than one NIC per GFU.

sjeaugey added a commit that referenced this issue Aug 24, 2023

2.18.5-1

559b70f

Fix NVLS search (issue #931). Increase max IB NICs to 32. Fix inconsistent device ordering (issue #820). Try to use different devices for different GPUs in systems with more than one NIC per GFU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL allreduce is slower than others in some certain process groups #820

NCCL allreduce is slower than others in some certain process groups #820

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NCCL allreduce is slower than others in some certain process groups #820

NCCL allreduce is slower than others in some certain process groups #820

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!