Open
Description
Describe the bug
To Reproduce
Steps to reproduce the behavior:
- ilab model serve on Gaudi system
- See Specified --gpus value (8) exceeds available GPUs (0).
Expected behavior
ilab to leverage vendor-specific API, eg. hpu.device_count() on Gaudi to verify the hpu count.
Screenshots
Device Info (please complete the following information):
- Hardware Specs: Gaudi 3 system (srv15 was used)
- OS Version: RHEL AI 1.5
- Python Version: Python 3.11.7
- InstructLab Version:
(app-root) /$ ilab system info
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
PT_HPU_EAGER_PIPELINE_ENABLE = 1
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
PT_HPU_ENABLE_LAZY_COLLECTIVES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 224
CPU RAM : 1056269996 KB
------------------------------------------------------------------------------
Platform:
sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
sys.platform: linux
os.name: posix
platform.release: 5.14.0-427.62.1.el9_4.x86_64
platform.machine: x86_64
platform.node: g3-srv15-c03b-idc
platform.python_version: 3.11.7
os-release.ID: rhel
os-release.VERSION_ID: 9.4
os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
memory.total: 1007.34 GB
memory.available: 993.20 GB
memory.used: 9.12 GB
InstructLab:
instructlab.version: 0.26.0
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.8.2
instructlab-training.version: 0.10.1
Torch:
torch.version: 2.6.0+hpu_1.20.1-97.gitec84ea4
torch.backends.cpu.capability: AVX512
torch.version.cuda: None
torch.version.hip: None
torch.cuda.available: False
torch.backends.cuda.is_built: False
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
habana_torch_plugin.version: 1.20.1.97
torch.hpu.is_available: True
torch.hpu.device_count: 8
torch.hpu.0.name: GAUDI3
torch.hpu.0.capability: 1.20.1.9870b10
torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.1.name: GAUDI3
torch.hpu.1.capability: 1.20.1.9870b10
torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.2.name: GAUDI3
torch.hpu.2.capability: 1.20.1.9870b10
torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.3.name: GAUDI3
torch.hpu.3.capability: 1.20.1.9870b10
torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.4.name: GAUDI3
torch.hpu.4.capability: 1.20.1.9870b10
torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.5.name: GAUDI3
torch.hpu.5.capability: 1.20.1.9870b10
torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.6.name: GAUDI3
torch.hpu.6.capability: 1.20.1.9870b10
torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
torch.hpu.7.name: GAUDI3
torch.hpu.7.capability: 1.20.1.9870b10
torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
env.HABANALABS_HLTHUNK_TESTS_BIN_PATH: /opt/habanalabs/src/hl-thunk/tests
env.HABANA_LOGS: /var/log/habana_logs/
env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
env.HABANA_PROFILE: profile_api_light
env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true
llama_cpp_python:
llama_cpp_python.version: 0.3.6
llama_cpp_python.supports_gpu_offload: False
Additional context
https://issues.redhat.com/browse/RHELAI-4052
From what I've seen, it's not an issue on AMD rocm HW, where the CUDA call works just fine too.