[serve.llm] vLLM engine became unhealthy under high incoming traffic

What happened + What you expected to happen

When serving a LLM (Qwen3-32B) using LLMServer.as_deployment, we noticed some health check timeout causing replica(s) to restart from time to time (around every half hour). No error message was thrown in the logs. When serving the model using vLLM serve with the same configs it worked without any issue (we were using vLLM v0 engine).

The ray serve config we used was:

deployment_config:
  num_replicas: 2
engine_kwargs:
  tensor_parallel_size: 4
  dtype: bfloat16
  enforce_eager: True
  gpu_memory_utilization: 0.8
  max_model_len: 32768
  enable_prefix_caching: True

Error Log:

1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:52 [metrics.py:486] Avg prompt throughput: 27408.8 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 49 reqs, Swapped: 0 reqs, Pending: 951 reqs, GPU KV cache usage: 16.8%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:52 [metrics.py:502] Prefix cache hit rate: GPU: 12.55%, CPU: 0.00% [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:58 [metrics.py:486] Avg prompt throughput: 23591.0 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 63 reqs, Swapped: 0 reqs, Pending: 937 reqs, GPU KV cache usage: 23.9%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:39:58 [metrics.py:502] Prefix cache hit rate: GPU: 11.12%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:39:58,304 controller 1758 -- Didn't receive health check response for replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:01,246 controller 1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:03 [metrics.py:486] Avg prompt throughput: 33982.4 tokens/s, Avg generation throughput: 9.2 tokens/s, Running: 111 reqs, Swapped: 0 reqs, Pending: 889 reqs, GPU KV cache usage: 31.7%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:03 [metrics.py:502] Prefix cache hit rate: GPU: 15.37%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,493 controller 1758 -- Didn't receive health check response for replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,494 controller 1758 -- Replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed the health check 3 times in a row, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:08,494 controller 1758 -- Replica Replica(id='6t84bxez', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed health check, stopping it.
(ServeController pid=1758) INFO 2025-06-25 04:40:08,496 controller 1758 -- Adding 1 replica to Deployment(name='vLLM:Qwen--Qwen3-32B', app='RayServeApp').
(ServeController pid=1758) ERROR 2025-06-25 04:40:08,502 controller 1758 -- The deployments ['vLLM:Qwen--Qwen3-32B'] are UNHEALTHY.
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:08 [metrics.py:486] Avg prompt throughput: 25758.0 tokens/s, Avg generation throughput: 2.9 tokens/s, Running: 127 reqs, Swapped: 0 reqs, Pending: 873 reqs, GPU KV cache usage: 39.2%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:08 [metrics.py:502] Prefix cache hit rate: GPU: 14.75%, CPU: 0.00% [repeated 2x across cluster]
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,338 controller 1758 -- Didn't receive health check response for replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') after 10.0s, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,339 controller 1758 -- Replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed the health check 3 times in a row, marking it unhealthy.
(ServeController pid=1758) WARNING 2025-06-25 04:40:11,339 controller 1758 -- Replica Replica(id='rut1591v', deployment='vLLM:Qwen--Qwen3-32B', app='RayServeApp') failed health check, stopping it.
(ServeController pid=1758) INFO 2025-06-25 04:40:11,340 controller 1758 -- Adding 1 replica to Deployment(name='vLLM:Qwen--Qwen3-32B', app='RayServeApp').
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:14 [metrics.py:486] Avg prompt throughput: 27000.1 tokens/s, Avg generation throughput: 3.9 tokens/s, Running: 148 reqs, Swapped: 0 reqs, Pending: 852 reqs, GPU KV cache usage: 47.0%, CPU KV cache usage: 0.0%. [repeated 2x across cluster]
(_EngineBackgroundProcess pid=194388) INFO 06-25 04:40:14 [metrics.py:502] Prefix cache hit rate: GPU: 14.16%, CPU: 0.00% [repeated 2x across cluster]

Versions / Dependencies

ray==2.46.0
vllm==0.8.5

Reproduction script

We created the llm server handles using (note: we directly used the LLMServer class w/o LLMRouter):

llm_handles = {}
for llm_config in llm_configs:
    llm_config = LLMConfig.model_validate(llm_config)
    llm_server_handle = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
    llm_handles[llm_config.model_id] = llm_server_handle

where llm_server_handle are used later

async for response in llm_server_handle.chat.remote(chat_request):
    ...

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions