Open
Description
What happened + What you expected to happen
When serving a LLM (Qwen3-32B) using LLMServer.as_deployment
, we noticed some health check timeout causing replica(s) to restart from time to time (around every half hour). No error message was thrown in the logs. When serving the model using vLLM serve with the same configs it worked without any issue (we were using vLLM v0 engine).
The ray serve config we used was:
deployment_config:
num_replicas: 2
engine_kwargs:
tensor_parallel_size: 4
dtype: bfloat16
enforce_eager: True
gpu_memory_utilization: 0.8
max_model_len: 32768
enable_prefix_caching: True
Error Log:
1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:52 [metrics.py:486] Avg prompt throughput: 27408.8 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 49 reqs, Swapped: 0 reqs, Pending: 951 reqs, GPU KV cache usage: 16.8%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:52 [metrics.py:502] Prefix cache hit rate: GPU: 12.55%, CPU: 0.00% [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:58 [metrics.py:486] Avg prompt throughput: 23591.0 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 63 reqs, Swapped: 0 reqs, Pending: 937 reqs, GPU KV cache usage: 23.9%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:58 [metrics.py:502] Prefix cache hit rate: GPU: 11.12%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:39:58,304 controller 1758 -- Didn't receive health check response for replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:01,246 controller 1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:03 [metrics.py:486] Avg prompt throughput: 33982.4 tokens/s, Avg generation throughput: 9.2 tokens/s, Running: 111 reqs, Swapped: 0 reqs, Pending: 889 reqs, GPU KV cache usage: 31.7%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:03 [metrics.py:502] Prefix cache hit rate: GPU: 15.37%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,493 controller 1758 -- Didn't receive health check response for replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,494 controller 1758 -- Replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed the health check 3 times in a row, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,494 controller 1758 -- Replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed health check, stopping it.
(ServeController pid=1758) INFO 2025-06-25 04:40:08,496 controller 1758 -- Adding 1 replica to Deployment(name='vLLM:Qwen--Qwen3-32B', app='RayServeApp').
(ServeController pid=1758) ERROR 2025-06-25 04:40:08,502 controller 1758 -- The deployments ['vLLM:Qwen--Qwen3-32B'] are UNHEALTHY.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:08 [metrics.py:486] Avg prompt throughput: 25758.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 127 reqs, Swapped: 0 reqs, Pending: 873 reqs, GPU KV cache usage: 39.2%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:08 [metrics.py:502] Prefix cache hit rate: GPU: 14.75%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,338 controller 1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,339 controller 1758 -- Replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed the health check 3 times in a row, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,339 controller 1758 -- Replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed health check, stopping it.
(ServeController pid=1758) INFO 2025-06-25 04:40:11,340 controller 1758 -- Adding 1 replica to Deployment(name='vLLM:Qwen--Qwen3-32B', app='RayServeApp').
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:14 [metrics.py:486] Avg prompt throughput: 27000.1 tokens/s, Avg generation throughput: 3.9 tokens/s, Running: 148 reqs, Swapped: 0 reqs, Pending: 852 reqs, GPU KV cache usage: 47.0%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:14 [metrics.py:502] Prefix cache hit rate: GPU: 14.16%, CPU: 0.00% [repeated 2x across cluster]
Versions / Dependencies
ray==2.46.0
vllm==0.8.5
Reproduction script
We created the llm server handles using (note: we directly used the LLMServer class w/o LLMRouter):
llm_handles = {}
for llm_config in llm_configs:
llm_config = LLMConfig.model_validate(llm_config)
llm_server_handle = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_handles[llm_config.model_id] = llm_server_handle
where llm_server_handle
are used later
async for response in llm_server_handle.chat.remote(chat_request):
...
Issue Severity
Medium: It is a significant difficulty but I can work around it.