8000 ilab's accelerator counting should be using vendor-specific APIs instead of torch.cuda.device_count() for HPUs · Issue #3357 · instructlab/instructlab · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
ilab's accelerator counting should be using vendor-specific APIs instead of torch.cuda.device_count() for HPUs #3357
Open
@frantisekz

Description

@frantisekz

Describe the bug

To Reproduce
Steps to reproduce the behavior:

  1. ilab model serve on Gaudi system
  2. See Specified --gpus value (8) exceeds available GPUs (0).

Expected behavior
ilab to leverage vendor-specific API, eg. hpu.device_count() on Gaudi to verify the hpu count.

Screenshots

Device Info (please complete the following information):

  • Hardware Specs: Gaudi 3 system (srv15 was used)
  • OS Version: RHEL AI 1.5
  • Python Version: Python 3.11.7
  • InstructLab Version:
(app-root) /$ ilab system info
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 224
CPU RAM       : 1056269996 KB
------------------------------------------------------------------------------
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.62.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: g3-srv15-c03b-idc
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1007.34 GB
  memory.available: 993.20 GB
  memory.used: 9.12 GB

InstructLab:
  instructlab.version: 0.26.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.1

Torch:
  torch.version: 2.6.0+hpu_1.20.1-97.gitec84ea4
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: None
  torch.cuda.available: False
  torch.backends.cuda.is_built: False
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  habana_torch_plugin.version: 1.20.1.97
  torch.hpu.is_available: True
  torch.hpu.device_count: 8
  torch.hpu.0.name: GAUDI3
  torch.hpu.0.capability: 1.20.1.9870b10
  torch.hpu.0.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.1.name: GAUDI3
  torch.hpu.1.capability: 1.20.1.9870b10
  torch.hpu.1.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.2.name: GAUDI3
  torch.hpu.2.capability: 1.20.1.9870b10
  torch.hpu.2.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.3.name: GAUDI3
  torch.hpu.3.capability: 1.20.1.9870b10
  torch.hpu.3.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.4.name: GAUDI3
  torch.hpu.4.capability: 1.20.1.9870b10
  torch.hpu.4.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.5.name: GAUDI3
  torch.hpu.5.capability: 1.20.1.9870b10
  torch.hpu.5.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.6.name: GAUDI3
  torch.hpu.6.capability: 1.20.1.9870b10
  torch.hpu.6.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  torch.hpu.7.name: GAUDI3
  torch.hpu.7.capability: 1.20.1.9870b10
  torch.hpu.7.properties: sramBaseAddress=144396662951903232, dramBaseAddress=144396800491520000, sramSize=0, dramSize=136465870848, tpcEnabledMask=18446744073709551615, dramEnabled=1, fd=22, device_id=0, device_type=5
  env.HABANALABS_HLTHUNK_TESTS_BIN_PATH: /opt/habanalabs/src/hl-thunk/tests
  env.HABANA_LOGS: /var/log/habana_logs/
  env.HABANA_PLUGINS_LIB_PATH: /opt/habanalabs/habana_plugins
  env.HABANA_PROFILE: profile_api_light
  env.HABANA_SCAL_BIN_PATH: /opt/habanalabs/engines_fw
  env.PT_HPU_ENABLE_LAZY_COLLECTIVES: true

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: False

Additional context
https://issues.redhat.com/browse/RHELAI-4052

From what I've seen, it's not an issue on AMD rocm HW, where the CUDA call works just fine too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0