-
Notifications
You must be signed in to change notification settings - Fork 4
akash-inventory-operator
/provider-services psutil
fail to detect NVIDIA GPUs with non-10de subsystem vendor ID
#298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A workaround patch has been submitted in PR #53 to extend This workaround is effective for now, but a more robust long-term solution would be to update |
akash-inventory-operator
fails to detect NVIDIA GPUs with non-10de subsystem vendor IDakash-inventory-operator
/provider-services
fail to detect NVIDIA GPUs with non-10de subsystem vendor ID
akash-inventory-operator
/provider-services
fail to detect NVIDIA GPUs with non-10de subsystem vendor IDakash-inventory-operator
/provider-services psutil
fail to detect NVIDIA GPUs with non-10de subsystem vendor ID
…rt Technology) Some RTX 4090 GPUs—such as those from CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA). As a result, these GPUs are not detected by `akash-inventory-operator`, and the `nvidia.com/gpu.present=true` label is never applied, preventing the NVIDIA k8s-device-plugin (NVDP) from scheduling on the node. This patch adds a new `10b0` vendor entry in `gpus.json` with the `2684` Product ID for RTX 4090, enabling proper detection and labeling. Detection is based solely on **Vendor ID** and **Product ID** and does not inspect **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in: https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950 Ref: akash-network/support#298
…rt Technology) Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`. This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`. > Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in: > https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950 This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs. Ref: akash-network/support#298
…rt Technology) Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`. This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`. > Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in: > https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950 This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs. Ref: akash-network/support#298
…b0 (CardExpert Technology) Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`. This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`. > Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in: > https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950 This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs. Ref: akash-network/support#298
…b0 (CardExpert Technology) (#53) Some RTX 4090 GPUs—such as those manufactured by CardExpert or other AIB partners—report a **Subsystem Vendor ID** of `10b0` instead of the expected `10de` (NVIDIA Corporation). Because of this, `akash-inventory-operator` fails to detect the GPU, and the required `nvidia.com/gpu.present=true` label is not applied to the node. This prevents the NVIDIA device plugin (NVDP) from scheduling GPU workloads, even though the GPU is fully functional and visible in `lspci`. This patch works around the issue by adding a new `10b0` vendor entry in `gpus.json`, mapping Product ID `2684` to the RTX 4090. This enables proper detection and labeling without requiring hardware modifications or changes to `provider-services`. > Detection is based solely on **Vendor ID** and **Product ID**, and does not evaluate the **Subsystem Vendor ID** or **Subsystem Device ID**, as confirmed in: > https://github.com/akash-network/provider/blob/v0.6.11-rc1/operator/inventory/node-discovery.go#L942-L950 This workaround is effective for now, but a more robust long-term solution would be to update `akash-inventory-operator` (and `provider-services psutil`) to ignore the **Subsystem Vendor ID** entirely and rely only on the primary PCI Vendor and Device IDs. Ref: akash-network/support#298
…etection The previous entry under vendor `10b0` used Product ID `2684`, which reflects the primary PCI device ID. This patch updates `gpus.json` to match the format currently used by the inventory operator, enabling proper detection and labeling of affected RTX 4090 GPUs. Ref: akash-network/support#298
…etection (#54) The previous entry under vendor `10b0` used Product ID `2684`, which reflects the primary PCI device ID. This patch updates `gpus.json` to match the format currently used by the inventory operator, enabling proper detection and labeling of affected RTX 4090 GPUs. Ref: akash-network/support#298
Two RTX 4090 GPUs on host report Subsystem Device ID f297 under Vendor ID 10de. Since provider-services uses this ID format for detection, f297 must be listed under 10de in gpus.json. This patch corrects the previous placement under vendor 10b0, ensuring all GPUs are properly detected and labeled. Ref: akash-network/support#298
Two RTX 4090 GPUs on host report Subsystem Device ID f297 under Vendor ID 10de. Since provider-services uses this ID format for detection, f297 must be listed under 10de in gpus.json. This patch corrects the previous placement under vendor 10b0, ensuring all GPUs are properly detected and labeled. Ref: akash-network/support#298
It looks like the workaround PR didn't do it in case of the host with mixed set of GPU's: original PR: also tried with:
The host has 6 GPUs with only two being
|
Most likely can be resolved the same way as it was resolved for #310 by simply loading Next time gonna make sure Update:The server had I'll keep this issue open so next time we can make sure |
Summary
The Akash
akash-inventory-operator
/provider-services psutil
fail to detect NVIDIA GPUs and apply thenvidia.com/gpu.present=true
label when the GPU has a non-NVIDIA subsystem vendor ID (e.g.,10b0
instead of10de
). This typically happens when the card is manufactured by an AIB (Add-In Board) partner like CardExpert or Gainward. As a result, the NVIDIA device plugin (NVDP) does not run, even though the GPU is visible inlspci
and fully functional.Example of such board: https://www.techpowerup.com/vgabios/251243/gainward-rtx4090-24576-221012
Details
On
node4
, the GPU is a valid NVIDIA 4090 but has:Because of this,
provider-services tools psutil list gpu
does not list the GPU at all, leading the inventory operator to skip labeling the node.On a working node (
node2
) with:the GPU is correctly detected and labeled.
Suggested Fix
Relax detection logic in
akash-inventory-operator
to match GPUs based on primary vendor ID (10de
) regardless of the subsystem vendor ID, or fallback tolspci
output ifpsutil
fails to detect the card.Workaround
Manually label the node to enable NVDP:
References
The text was updated successfully, but these errors were encountered: