8000 `akash-inventory-operator`/`provider-services psutil` fail to detect NVIDIA GPUs with non-10de subsystem vendor ID · Issue #298 · akash-network/support · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

akash-inventory-operator/provider-services psutil fail to detect NVIDIA GPUs with non-10de subsystem vendor ID #298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
andy108369 opened this issue Apr 10, 2025 · 3 comments
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor
andy108369 commented Apr 10, 2025

Summary

The Akash akash-inventory-operator/provider-services psutil fail to detect NVIDIA GPUs and apply the nvidia.com/gpu.present=true label when the GPU has a non-NVIDIA subsystem vendor ID (e.g., 10b0 instead of 10de). This typically happens when the card is manufactured by an AIB (Add-In Board) partner like CardExpert or Gainward. As a result, the NVIDIA device plugin (NVDP) does not run, even though the GPU is visible in lspci and fully functional.

Example of such board: https://www.techpowerup.com/vgabios/251243/gainward-rtx4090-24576-221012

Details

On node4, the GPU is a valid NVIDIA 4090 but has:

Subsystem: CardExpert Technology Device [10b0:f297]

Because of this, provider-services tools psutil list gpu does not list the GPU at all, leading the inventory operator to skip labeling the node.

On a working node (node2) with:

Subsystem: NVIDIA Corporation Device [10de:2684]

the GPU is correctly detected and labeled.

Suggested Fix

Relax detection logic in akash-inventory-operator to match GPUs based on primary vendor ID (10de) regardless of the subsystem vendor ID, or fallback to lspci output if psutil fails to detect the card.

Workaround

Manually label the node to enable NVDP:

kubectl label node <node-name> nvidia.com/gpu.present=true

References

@andy108369
Copy link
Contributor Author
andy108369 commented Apr 10, 2025

A workaround patch has been submitted in PR #53 to extend gpus.json and recognize the RTX 4090 under vendor ID 10b0 (used by CardExpert Technology and other AIBs). This enables GPU detection and labeling by the akash-inventory-operator without requiring changes to provider-services or any hardware-level modifications.

This workaround is effective for now, but a more robust long-term solution would be to update akash-inventory-operator (and provider-services psutil) to ignore the Subsystem Vendor ID entirely and rely only on the primary PCI Vendor and Device IDs.

@andy108369 andy108369 changed the title akash-inventory-operator fails to detect NVIDIA GPUs with non-10de subsystem vendor ID akash-inventory-operator/provider-services fail to detect NVIDIA GPUs with non-10de subsystem vendor ID Apr 10, 2025
@andy108369 andy108369 changed the title akash-inventory-operator/provider-services fail to detect NVIDIA GPUs with non-10de subsystem vendor ID akash-inventory-operator/provider-services psutil fail to detect NVIDIA GPUs with non-10de subsystem vendor ID Apr 10, 2025
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…rt Technology)

Some RTX 4090 GPUs—such as those from CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA). As a result, these GPUs are not detected by `akash-inventory-operator`, and the `nvidia.com/gpu.present=true` label is never applied, preventing the NVIDIA k8s-device-plugin (NVDP) from scheduling on the node.

This patch adds a new `10b0` vendor entry in `gpus.json` with the `2684` Product ID for RTX 4090, enabling proper detection and labeling.

Detection is based solely on **Vendor ID** and **Product ID** and does not inspect **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in:
https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…rt Technology)

Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`.

This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`.

> Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in:
> https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950

This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs.

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…rt Technology)

Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`.

This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`.

> Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in:
> https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950

This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs.

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…b0 (CardExpert Technology)

Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`.

This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`.

> Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in:
> https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950

This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs.

Ref: akash-network/support#298
Zblocker64 pushed a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…b0 (CardExpert Technology) (#53)

Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`.

This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`.

> Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in:
> https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950

This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs.

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…etection

The previous entry under vendor `10b0` used Product ID `2684`, which reflects the primary PCI device ID. This patch updates `gpus.json` to match the format currently used by the inventory operator, enabling proper detection and labeling of affected RTX 4090 GPUs.

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
…etection (#54)

The previous entry under vendor `10b0` used Product ID `2684`, which reflects the primary PCI device ID. This patch updates `gpus.json` to match the format currently used by the inventory operator, enabling proper detection and labeling of affected RTX 4090 GPUs.

Ref: akash-network/support#298
andy108369 added a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
Two RTX 4090 GPUs on host report Subsystem Device ID f297 under Vendor ID 10de.
Since provider-services uses this ID format for detection, f297 must be
listed under 10de in gpus.json. This patch corrects the previous placement under
vendor 10b0, ensuring all GPUs are properly detected and labeled.

Ref: akash-network/support#298
Zblocker64 pushed a commit to akash-network/provider-configs that referenced this issue Apr 10, 2025
Two RTX 4090 GPUs on host report Subsystem Device ID f297 under Vendor ID 10de.
Since provider-services uses this ID format for detection, f297 must be
listed under 10de in gpus.json. This patch corrects the previous placement under
vendor 10b0, ensuring all GPUs are properly detected and labeled.

Ref: akash-network/support#298
@andy108369
Copy link
Contributor Author
andy108369 commented Apr 10, 2025

It looks like the workaround PR didn't do it in case of the host with mixed set of GPU's:

original PR:

also tried with:

I made sure cache is clean by inspecting curl -s https://provider-configs.akash.network/devices/gpus | jq -r . before bouncing the operator-inventory

The host has 6 GPUs with only two being 10b0:f297, the rest four are 10de:2684:

root@node4:~# lspci -s 01:00.0 -nn -v | head -2
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: CardExpert Technology Device [10b0:f297]
root@node4:~# lspci -s 42:00.0 -nn -v | head -2
42:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device [10de:2684]
root@node4:~# lspci -s 81:00.0 -nn -v | head -2
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device [10de:2684]
root@node4:~# lspci -s 82:00.0 -nn -v | head -2
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device [10de:2684]
root@node4:~# lspci -s c1:00.0 -nn -v | head -2
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device [10de:2684]
root@node4:~# lspci -s c2:00.0 -nn -v | head -2
c2:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: CardExpert Technology Device [10b0:f297]

@andy108369
Copy link
Contributor Author
andy108369 commented Apr 17, 2025

Most likely can be resolved the same way as it was resolved for #310 by simply loading nvidia_drm kernel module.
I saved the screenshot -- the last time nvidia_drm wasn't loaded:

Image

Next time gonna make sure nvidia_drm kernel module is actually loaded (modprobe nvidia_drm) to see if it works.


Update:

The server had 10b0:f297 GPUs replaced back with 10de:2684 and they get detected again as normal:

Image

I'll keep this issue open so next time we can make sure nvidia_drm is loaded when we get 10b0:f297 again, to test the detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
Status: Backlog (not prioritized)
Development

No branches or pull requests

3 participants
0