[Serve] Unable to load meta-llama/Llama-3.3-70B-Instruct

What happened + What you expected to happen

I have configured Ray Serve 2.46.0 (via KubeRay) to deploy meta-llama/Llama-3.3-70B-Instruct. Unfortunately, I cannot get it to load even though the same model will successfully load in a KubeAI based setup that has vLLM configured to connect to an external Ray Cluster (in that case, 2.43.0 and also via KubeRay).

What I expected to happen was the the model would be loaded successful. Unfortunately, it the deployment failed, unlike with Gemma 3 and DeepSeek models we have been testing. The stock Llama 3.3 model from Hugging Face does not contain a processor config.

Here is the stack trace:

  File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 984, in initialize_and_get_metadata
    await self._replica_impl.initialize(deployment_config)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 713, in initialize
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 690, in initialize
    self._user_callable_asgi_app = await asyncio.wrap_future(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1384, in initialize_callable
    await self._call_func_or_gen(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1347, in _call_func_or_gen
    result = await result
             ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 440, in __init__
    await asyncio.wait_for(self._start_engine(), timeout=ENGINE_START_TIMEOUT_S)
  File "/home/ray/anaconda3/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 486, in _start_engine
    await self.engine.start()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 232, in start
    self.engine = await self._start_engine()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 272, in _start_engine
    return await self._start_engine_v1()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 348, in _start_engine_v1
    ) = await self._prepare_engine_config(use_v1=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 287, in _prepare_engine_config
    node_initialization = await self.initialize_node(self.llm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 218, in initialize_node
    return await initialize_node_util(llm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/deployments/utils/node_initialization_utils.py", line 123, in initialize_node
    llm_config.apply_checkpoint_info(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/configs/server_models.py", line 279, in apply_checkpoint_info
    self._prompt_format.set_processor(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/serve/configs/prompt_formats.py", line 142, in set_processor
    self._processor = transformers.AutoProcessor.from_pretrained(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/transformers/models/auto/processing_auto.py", line 375, in from_pretrained
    raise ValueError(
ValueError: Unrecognized processing class in meta-llama/Llama-3.3-70B-Instruct. Can't instantiate a processor, a tokenizer, an image processor or a feature extractor for this model. Make sure the repository contains the files of at least one of those processing classes.
            Status:  DEPLOY_FAILED

Versions / Dependencies

Ray Serve - 2.46.0
Ray LLM - rayproject/ray-llm:2.46.0-py311-cu124

Reproduction script

Here is the Ray Serve configuration. Note that this works fine with the Gemma 3 and DeepSeek models that we are also using, but I've removed them from the configuration to leave the important information.

I am deploying this to Azure on a H100 based data plane but that isn't important here. All that's needed to reproduce this issue is found in serveConfigV2.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llm-catalogue
spec:
  serveConfigV2: |
    applications:
      - args:
          llm_configs:
              - model_loading_config:
                  model_id: llama-3-3-70b-instruct
                  model_source: meta-llama/Llama-3.3-70B-Instruct
                accelerator_type: H100
                deployment_config:
                  autoscaling_config:
                    min_replicas: 1
                    max_replicas: 1
                engine_kwargs:
                  dtype: auto
                  enable_chunked_prefill: true
                  enable_prefix_caching: true
                  gpu_memory_utilization: 0.95
                  pipeline_parallel_size: 2
                  tensor_parallel_size: 2
                  trust_remote_code: true
                runtime_env:
                  env_vars:
                    VLLM_USE_V1: "1"
        import_path: ray.serve.llm:build_openai_app
        name: llm-catalogue
        route_prefix: "/"
  rayClusterConfig:
    rayVersion: 2.46.0
    headGroupSpec:
      rayStartParams:
        dashboard-host: 0.0.0.0
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "true"
        spec:
          affinity: {}
          containers:
            - name: ray-head
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name:huggingface
                      key: token
              image: rayproject/ray-llm:2.46.0-py311-cu124
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8000
                  name: serve
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: 2
                  memory: 8G
                requests:
                  cpu: 1
                  memory: 4G
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
          nodeSelector:
            agentpool: system
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
      - groupName: h100x2
        maxReplicas: 4
        minReplicas: 2
        rayStartParams: {}
        template:
          metadata:
            annotations:
              prometheus.io/scrape: "true"
          spec:
            affinity: {}
            containers:
              - name: ray-worker
                env:
                  - name: HF_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: huggingface
                        key: token
                image: rayproject/ray-llm:2.46.0-py311-cu124
                resources:
                  limits:
                    cpu: 80
                    memory: 640Gi
                    nvidia.com/gpu: 2
                  requests:
                    cpu: 40
                    memory: 384Gi
                    nvidia.com/gpu: 2
                volumeMounts:
                  - mountPath: /home/ray/.cache/huggingface/hub
                    name: huggingface-cache
            nodeSelector:
              agentpool: h100x2
            securityContext: {}
            tolerations:
              - key: nvidia.com/gpu
                operator: Exists
                effect: NoSchedule
              - key: kubernetes.azure.com/scalesetpriority
                operator: Equal
                value: spot
                effect: NoSchedule
            volumes:
              - name: huggingface-cache
                persistentVolumeClaim:
                  claimName: huggingface-cache

This is the PVC definition.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: huggingface-cache
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 750Gi
  storageClassName: azurefile-csi

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions