Description
What happened + What you expected to happen
I have configured Ray Serve 2.46.0 (via KubeRay) to deploy meta-llama/Llama-3.3-70B-Instruct
. Unfortunately, I cannot get it to load even though the same model will successfully load in a KubeAI based setup that has vLLM configured to connect to an external Ray Cluster (in that case, 2.43.0 and also via KubeRay).
What I expected to happen was the the model would be loaded successful. Unfortunately, it the deployment failed, unlike with Gemma 3 and DeepSeek models we have been testing. The stock Llama 3.3 model from Hugging Face does not contain a processor config.
Here is the stack trace:
File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 984, in initialize_and_get_metadata
await self._replica_impl.initialize(deployment_config)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 713, in initialize
raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 690, in initialize
self._user_callable_asgi_app = await asyncio.wrap_future(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1384, in initialize_callable
await self._call_func_or_gen(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1347, in _call_func_or_gen
result = await result
^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 440, in __init__
await asyncio.wait_for(self._start_engine(), timeout=ENGINE_START_TIMEOUT_S)
File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 486, in _start_engine
await self.engine.start()
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 232, in start
self.engine = await self._start_engine()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 272, in _start_engine
return await self._start_engine_v1()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 348, in _start_engine_v1
) = await self._prepare_engine_config(use_v1=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 287, in _prepare_engine_config
node_initialization = await self.initialize_node(self.llm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 218, in initialize_node
return await initialize_node_util(llm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/utils/node_initialization_utils.py", line 123, in initialize_node
llm_config.apply_checkpoint_info(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/configs/server_models.py", line 279, in apply_checkpoint_info
self._prompt_format.set_processor(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/configs/prompt_formats.py", line 142, in set_processor
self._processor = transformers.AutoProcessor.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/transformers/models/auto/processing_auto.py", line 375, in from_pretrained
raise ValueError(
ValueError: Unrecognized processing class in meta-llama/Llama-3.3-70B-Instruct. Can't instantiate a processor, a tokenizer, an image processor or a feature extractor for this model. Make sure the repository contains the files of at least one of those processing classes.
Status: DEPLOY_FAILED
Versions / Dependencies
Ray Serve - 2.46.0
Ray LLM - rayproject/ray-llm:2.46.0-py311-cu124
Reproduction script
Here is the Ray Serve configuration. Note that this works fine with the Gemma 3 and DeepSeek models that we are also using, but I've removed them from the configuration to leave the important information.
I am deploying this to Azure on a H100 based data plane but that isn't important here. All that's needed to reproduce this issue is found in serveConfigV2.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llm-catalogue
spec:
serveConfigV2: |
applications:
- args:
llm_configs:
- model_loading_config:
model_id: llama-3-3-70b-instruct
model_source: meta-llama/Llama-3.3-70B-Instruct
accelerator_type: H100
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 1
engine_kwargs:
dtype: auto
enable_chunked_prefill: true
enable_prefix_caching: true
gpu_memory_utilization: 0.95
pipeline_parallel_size: 2
tensor_parallel_size: 2
trust_remote_code: true
runtime_env:
env_vars:
VLLM_USE_V1: "1"
import_path: ray.serve.llm:build_openai_app
name: llm-catalogue
route_prefix: "/"
rayClusterConfig:
rayVersion: 2.46.0
headGroupSpec:
rayStartParams:
dashboard-host: 0.0.0.0
template:
metadata:
annotations:
prometheus.io/scrape: "true"
spec:
affinity: {}
containers:
- name: ray-head
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name:huggingface
key: token
image: rayproject/ray-llm:2.46.0-py311-cu124
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8000
name: serve
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: 2
memory: 8G
requests:
cpu: 1
memory: 4G
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
nodeSelector:
agentpool: system
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- groupName: h100x2
maxReplicas: 4
minReplicas: 2
rayStartParams: {}
template:
metadata:
annotations:
prometheus.io/scrape: "true"
spec:
affinity: {}
containers:
- name: ray-worker
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface
key: token
image: rayproject/ray-llm:2.46.0-py311-cu124
resources:
limits:
cpu: 80
memory: 640Gi
nvidia.com/gpu: 2
requests:
cpu: 40
memory: 384Gi
nvidia.com/gpu: 2
volumeMounts:
- mountPath: /home/ray/.cache/huggingface/hub
name: huggingface-cache
nodeSelector:
agentpool: h100x2
securityContext: {}
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
volumes:
- name: huggingface-cache
persistentVolumeClaim:
claimName: huggingface-cache
This is the PVC definition.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: huggingface-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 750Gi
storageClassName: azurefile-csi
Issue Severity
Medium: It is a significant difficulty but I can work around it.