Closed
Description
Hello, I was able to setup a ray cluster with 16 GPUs for vLLM and here's how it looks:
(vllm) <dayeol@node0> benchmarks ray list nodes
======== List: 2025-06-26 11:28:29.307494 ========
Stats:
------------------------------
Total: 2
Table:
------------------------------
NODE_ID NODE_IP IS_HEAD_NODE STATE STATE_MESSAGE NODE_NAME RESOURCES_TOTAL LABELS
0 05dc9fcb13c3975706b29e51c173fee8987d3d33c90603c4462c8a92 node0.internal True ALIVE node0.internal CPU: 224.0 ray.io/node_id: 05dc9fcb13c3975706b29e51c173fee8987d3d33c90603c4462c8a92
GPU: 8.0
accelerator_type:H100: 1.0
memory: 1477.507 GiB
node:__internal_head__: 1.0
node:node0.internal: 1.0
object_store_memory: 186.265 GiB
1 6808cee8a2bdc20ad1ec345a2dfe05712c3b80676bbd796e84b82d20 node1.internal False ALIVE node1.internal CPU: 224.0 ray.io/node_id: 6808cee8a2bdc20ad1ec345a2dfe05712c3b80676bbd796e84b82d20
GPU: 8.0
accelerator_type:H100: 1.0
memory: 1505.393 GiB
node:node1.internal: 1.0
object_store_memory: 186.265 GiB
Each of the node has only IPv6 address, but the resource is annotated with their domain names.
But when I run my workload, it tries to request {'node:2d27c:1f9f:d698:63e7:94a4:bcd2:b36a:1b3f': 0.001, 'GPU': 1.0}
where 2d27c:1f9f:d698:63e7:94a4:bcd2:b36a:1b3f
is the IPv6 address of node0.internal
.
This request fails to fullfil because of the mismatch.
So if I try this:
@ray.remote(resources={"node:2d27c:1f9f:d698:63e7:94a4:bcd2:b36a:1b3f": 0.01}, num_gpus=0.1)
class Actor1:
def __init__(self):
pass
actor1 = Actor1.remote()
@ray.remote(resources={"node:node0.internal": 0.01}, num_gpus=0.1)
class Actor2:
def __init__(self):
pass
actor2 = Actor2.remote()
I only see Actor2
from ray list actors
, and Actor1 fails to start with the following error:
(autoscaler +1m24s, ip=node0.internal) Error: No available node types can fulfill resource request {'CPU': 1.0, 'node:2d27c:1f9f:d698:63e7:94a4:bcd2:b36a:1b3f': 0.01, 'GPU': 0.1}. Add suitable node types to this cluster to resolve this issue.
What would be the quick fix to the issue?
Not sure if this is an issue of vLLM or ray.
Please help!
Metadata
Metadata
Assignees
Labels
No labels