Description
Summary
The Akash akash-inventory-operator
/provider-services psutil
fail to detect NVIDIA GPUs and apply the nvidia.com/gpu.present=true
label when the GPU has a non-NVIDIA subsystem vendor ID (e.g., 10b0
instead of 10de
). This typically happens when the card is manufactured by an AIB (Add-In Board) partner like CardExpert or Gainward. As a result, the NVIDIA device plugin (NVDP) does not run, even though the GPU is visible in lspci
and fully functional.
Example of such board: https://www.techpowerup.com/vgabios/251243/gainward-rtx4090-24576-221012
Details
On node4
, the GPU is a valid NVIDIA 4090 but has:
Subsystem: CardExpert Technology Device [10b0:f297]
Because of this, provider-services tools psutil list gpu
does not list the GPU at all, leading the inventory operator to skip labeling the node.
On a working node (node2
) with:
Subsystem: NVIDIA Corporation Device [10de:2684]
the GPU is correctly detected and labeled.
Suggested Fix
Relax detection logic in akash-inventory-operator
to match GPUs based on primary vendor ID (10de
) regardless of the subsystem vendor ID, or fallback to lspci
output if psutil
fails to detect the card.
Workaround
Manually label the node to enable NVDP:
kubectl label node <node-name> nvidia.com/gpu.present=true
References
Metadata
Metadata
Assignees
Type
Projects
Status