ilab's accelerator counting should be using vendor-specific APIs instead of torch.cuda.device_count() for HPUs

Describe the bug

To Reproduce
Steps to reproduce the behavior:

ilab model serve on Gaudi system
See Specified --gpus value (8) exceeds available GPUs (0).

Expected behavior
ilab to leverage vendor-specific API, eg. hpu.device_count() on Gaudi to verify the hpu count.

Screenshots

Device Info (please complete the following information):

Hardware Specs: Gaudi 3 system (srv15 was used)
OS Version: RHEL AI 1.5
Python Version: Python 3.11.7
InstructLab Version:

(app-root) /$ ilab system info
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 224
CPU RAM       : 1056269996 KB
------------------------------------------------------------------------------
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.62.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: g3-srv15-c03b-idc
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1007.34 GB
  memory.available: 993.20 GB
  memory.used: 9.12 GB

InstructLab:
  instructlab.version: 0.26.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.1

Torch:
  torch.version: 2.6.0+hpu_1.20.1-97.gitec84ea4
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  habana_torch_plugin.version: 1.20.1.97
  torch.hpu.is_available: True
  torch.hpu.device_count: 8
  torch.hpu.0.name: GAUDI3
  torch.hpu.0.capability: 1.20.1.9870b10
  torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.1.name: GAUDI3
  torch.hpu.1.capability: 1.20.1.9870b10
  torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.2.name: GAUDI3
  torch.hpu.2.capability: 1.20.1.9870b10
  torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.3.name: GAUDI3
  torch.hpu.3.capability: 1.20.1.9870b10
  torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.4.name: GAUDI3
  torch.hpu.4.capability: 1.20.1.9870b10
  torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.5.name: GAUDI3
  torch.hpu.5.capability: 1.20.1.9870b10
  torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.6.name: GAUDI3
  torch.hpu.6.capability: 1.20.1.9870b10
  torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.7.name: GAUDI3
  torch.hpu.7.capability: 1.20.1.9870b10
  torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  env.HABANALABS_HLTHUNK_TESTS_BIN_PATH: /opt/habanalabs/src/hl-thunk/tests
  env.HABANA_LOGS: /var/log/habana_logs/
  env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
  env.HABANA_PROFILE: profile_api_light
  env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
  env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: False

Additional context
https://issues.redhat.com/browse/RHELAI-4052

From what I've seen, it's not an issue on AMD rocm HW, where the CUDA call works just fine too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions