8000 [RFC] LLM APIs for Ray Data and Ray Serve · Issue #50639 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[RFC] LLM APIs for Ray Data and Ray Serve #50639
Closed
@richardliaw

Description

@richardliaw

RFC: LLM APIs for Ray Data and Ray Serve

Summary

This RFC proposes new APIs in Ray for leveraging Large Language Models (LLMs) effectively within the Ray ecosystem, specifically introducing integrations for Ray Serve and Ray Data with vLLM and OpenAI.

Motivation

As LLMs become increasingly central to modern AI infrastructure deployments, platforms require the ability to deploy and scale these models efficiently. Current Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine.

This proposal aims to address these challenges by providing unified, production-ready APIs for both batch processing and serving of LLMs within Ray in ray.data.llm and ray.serve.llm.

Ray Data LLM

ray.data.llm introduces several key components:

  1. build_llm_processor: Unified API for constructing processors
  2. Processors: User-facing API for specific LLM functionalities, including integration with vLLM and endpoint-based deployments
  3. ProcessorConfig: Configuration interface for Processors

Design Principles:

  • Integrate seamlessly with existing Ray Data APIs
  • One processor contains at most one LLM engine
  • Configurable but with sensible defaults for optimal throughput
import ray
from ray.data.llm import build_llm_processor, VLLMProcessorConfig

processor_config = VLLMProcessorConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
)

processor = build_llm_processor(
    processor_config,
    preprocess=lambda row: dict(
        messages=row["question"],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=250,
        )
    ),
    postprocess=lambda row: dict(
        answer=row["generated_text"]
    ),
    concurrency=4,
)

ds = ray.data.read_parquet(...)
ds = processor(ds)
ds.write_parquet(...)

You can also make calls to deployed models that have an OpenAI compatible API endpoint.

from ray.data.llm import HTTPRequestProcessorConfig

OPENAI_KEY = "..."
ds = ray.data.read_parquet("...")

ds = build_llm_processor(
    HTTPRequestProcessorConfig(
        url="https://api.openai.com/v1/chat/completions",
        header=f"Authorization: Bearer {OPENAI_KEY}",
        qps=1,
    ),
    preprocess=lambda row: dict(
        model="gpt-4o-mini",
        messages=row["messages"],
        sampling_params=dict(
            temperature=0.0,
            max_tokens=150,
        ),
    ),
    postprocess=lambda row: dict(
        resp=row["generated_text"]
    ),
    concurrency=8,
)(ds)

ds.write_parquet("...")

Ray Serve LLM

The new ray.serve.llm provides:

  1. VLLMDeployment: Manages VLLM engine deployment
  2. LLMModelRouter: OpenAI-compatible API router
  3. LLMConfig: Unified configuration for model deployment
  4. LoRA Support: Multi-adapter sharing with LRU caching

These features allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.

from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig, ModelLoadingConfig

llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.1-8b",
        model_source="meta-llama/Llama-3.1-8b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1,
            max_replicas=8,
        )
    ),
)

vllm_deployment = VLLMDeployment.options(**llm_config.get_serve_options()).bind(llm_config)
serve.run(vllm_deployment)

Below is a more comprehensive example of using the OpenAI API with Ray Serve.

from ray import serve
from ray.serve.llm import LLMModelRouterDeployment, VLLMDeployment, LLMConfig, ModelLoadingConfig

# Configure multiple models
llm_config1 = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.1-8b",
        model_source="meta-llama/Llama-3.1-8b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1,
            max_replicas=8,
        )
    ),
)

llm_config2 = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        served_model_name="llama-3.2-3b",
        model_source="meta-llama/Llama-3.2-3b-instruct",
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1,
            max_replicas=8,
        )
    ),
)

# Create deployments
vllm_deployment1 = VLLMDeployment.options(**llm_config1.get_serve_options()).bind(llm_config1)
vllm_deployment2 = VLLMDeployment.options(**llm_config2.get_serve_options()).bind(llm_config2)

# Create router deployment
llm_app = LLMModelRouterDeployment.options().bind([vllm_deployment1, vllm_deployment2])

# Deploy the application
serve.run(llm_app)

And you can now use an OpenAI API client to interact with the deployed models.

from openai import OpenAI

# Initialize client with your deployment endpoint
client = OpenAI(
    base_url="http://localhost:8000",
    api_key="fake-key"  # The API key is not validated
)

# Chat completion
chat_response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=150
)

# Text completion
completion_response = client.completions.create(
    model="llama-3.2-3b",
    prompt="Write a poem about",
    temperature=0.7,
    max_tokens=150
)

Advanced Example: RAG Pipeline with ray.data.llm

Here is a more complex example that demonstrates how to build a RAG pipeline using
the new ray.data.llm APIs.

import ray
from ray.data.llm import (
    build_llm_processor,
    VLLMProcessorConfig,
    HTTPRequestProcessorConfig,
)

# Embedding generation
embed_processor_config = VLLMProcessorConfig(
    model="google-bert/bert-base-uncased",
    task_type="embed",
)

# Vector DB querying
retrieve_processor_config = HTTPRequestProcessorConfig(
    url="http://vector-db-endpoint",
    header="...",
    qps=1,
)

# LLM processing
llm_processor_config = VLLMProcessorConfig(
    model="meta-llama/Llama-3.1-70B-Instruct",
    engine_kwargs=dict(pipeline_parallel_size=2),
)

# Pipeline construction
ds = ray.data.read_parquet("...")

# Generate embeddings
ds = build_llm_processor(
    embed_processor_config,
    preprocess=lambda row: dict(prompt=row["question"]),
    postprocess=lambda row: dict(embedding=row["embedding"]),
    concurrency=2,
)(ds)

# Query vector DB
ds = build_llm_processor(
    retrieve_processor_config,
    preprocess=lambda row: dict(body=row["embedding"]),
    postprocess=lambda row: dict(retrieved=row["text"]),
    concurrency=4,
)(ds)

# Generate answers
ds = build_llm_processor(
    llm_processor_config,
    preprocess=lambda row: dict(
        messages=[
            {"role": "system", "content": "..."},
            {"role": "user", "content": f"{row['question']}\n{row['retrieved']}"},
        ],
        sampling_params=dict(temperature=0.3, max_tokens=250)
    ),
    concurrency=4,
)(ds)

Advanced Example: LoRA serving with ray.serve.llm

from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig, ModelLoadingConfig, LoraConfig
from ray.serve import DeploymentConfig, AutoscalingConfig

# Configure the LLM with LoRA support
llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        model_source="meta-llama/Llama-3.1-8b-instruct",
        served_model_name="llama-3.1-8b"
    ),
    lora_config=LoraConfig(
        # Path containing all LoRA adapters
        dynamic_lora_loading_path="s3://my-bucket/llama-loras/",
        # Maximum number of LoRA adapters that can share a single base model
        max_num_adapters_per_replica=4,
    ),
    deployment_config=DeploymentConfig(
        autoscaling_config=AutoscalingConfig(
            min_replicas=1,
            max_replicas=4
        )
    )
)

# Create and deploy the model
vllm_deployment = VLLMDeployment.options(**llm_config.get_serve_options()).bind(llm_config)
serve.run(vllm_deployment)

ray.serve.llm config-based API

There is also a configuration-based API for serving LLMs, where
the configurations can be declared separately from the application logic.

from pydantic import BaseModel, Field
from typing import List, Union
from ray.serve.llm import VLLMDeployment, LLMConfig

def build_vllm_deployment(llm_config: LLMConfig):
  return VLLMDeployment.options(llm_config.get_serve_options()).bind(llm_config)


class LLMServingArgs(BaseModel):
    llm_configs: List[Union[str, LLMConfig]] = Field(
        description="A list of LLMConfigs, or paths to LLMConfigs, to run.",
    )


def build_openai_app(llm_serving_args: LLMServingArgs):
    llm_configs = llm_serving_args.llm_configs
    llm_deployments = []
    for llm_config in llm_configs:
        if isinstance(llm_config, str):
            llm_config = LLMConfig.from_yaml(llm_config)
        llm_deployments.append(build_vllm_deployment(llm_config))
    return LLMModelRouterDeployment.bind(llm_deployments)

Sample config.yaml

# Demonstrate inline llm configs in the Serve config
application:
  name: llm_app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
    - model_loading_config:
        model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
      accelerator_type: A10G
      tensor_parallelism:
        degree: 1
      deployment_config:
        autoscaling_config:
          min_replicas: 1
          max_replicas: 2

Future Work

  1. Support for additional LLM inference engines beyond vLLM
  2. Enhanced monitoring and observability
  3. Advanced batching and scheduling optimizations
  4. Additional processor types for specialized workflows

cc @comaniac @kouroshHakha @akshay-anyscale @gvspraveen

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0