8000 `akash-inventory-operator`/`provider-services psutil` fail to detect NVIDIA GPUs with non-10de subsystem vendor ID · Issue #298 · akash-network/support · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
10000 Skip to content
akash-inventory-operator/provider-services psutil fail to detect NVIDIA GPUs with non-10de subsystem vendor ID #298
Open
@andy108369

Description

@andy108369

Summary

The Akash akash-inventory-operator/provider-services psutil fail to detect NVIDIA GPUs and apply the nvidia.com/gpu.present=true label when the GPU has a non-NVIDIA subsystem vendor ID (e.g., 10b0 instead of 10de). This typically happens when the card is manufactured by an AIB (Add-In Board) partner like CardExpert or Gainward. As a result, the NVIDIA device plugin (NVDP) does not run, even though the GPU is visible in lspci and fully functional.

Example of such board: https://www.techpowerup.com/vgabios/251243/gainward-rtx4090-24576-221012

Details

On node4, the GPU is a valid NVIDIA 4090 but has:

Subsystem: CardExpert Technology Device [10b0:f297]

Because of this, provider-services tools psutil list gpu does not list the GPU at all, leading the inventory operator to skip labeling the node.

On a working node (node2) with:

Subsystem: NVIDIA Corporation Device [10de:2684]

the GPU is correctly detected and labeled.

Suggested Fix

Relax detection logic in akash-inventory-operator to match GPUs based on primary vendor ID (10de) regardless of the subsystem vendor ID, or fallback to lspci output if psutil fails to detect the card.

Workaround

Manually label the node to enable NVDP:

kubectl label node <node-name> nvidia.com/gpu.present=true

References

Metadata

Metadata

Assignees

Labels

repo/providerAkash provider-services repo issues

Type

No type

Projects

Status

Backlog (not prioritized)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0