Description
RFC: LLM APIs for Ray Data and Ray Serve
Summary
This RFC proposes new APIs in Ray for leveraging Large Language Models (LLMs) effectively within the Ray ecosystem, specifically introducing integrations for Ray Serve and Ray Data with vLLM and OpenAI.
Motivation
As LLMs become increasingly central to modern AI infrastructure deployments, platforms require the ability to deploy and scale these models efficiently. Current Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine.
This proposal aims to address these challenges by providing unified, production-ready APIs for both batch processing and serving of LLMs within Ray in ray.data.llm
and ray.serve.llm
.
Ray Data LLM
ray.data.llm
introduces several key components:
- build_llm_processor: Unified API for constructing processors
- Processors: User-facing API for specific LLM functionalities, including integration with vLLM and endpoint-based deployments
- ProcessorConfig: Configuration interface for Processors
Design Principles:
- Integrate seamlessly with existing Ray Data APIs
- One processor contains at most one LLM engine
- Configurable but with sensible defaults for optimal throughput
import ray
from ray.data.llm import build_llm_processor, VLLMProcessorConfig
processor_config = VLLMProcessorConfig(
model="meta-llama/Llama-3.1-8B-Instruct",
)
processor = build_llm_processor(
processor_config,
preprocess=lambda row: dict(
messages=row["question"],
sampling_params=dict(
temperature=0.3,
max_tokens=250,
)
),
postprocess=lambda row: dict(
answer=row["generated_text"]
),
concurrency=4,
)
ds = ray.data.read_parquet(...)
ds = processor(ds)
ds.write_parquet(...)
You can also make calls to deployed models that have an OpenAI compatible API endpoint.
from ray.data.llm import HTTPRequestProcessorConfig
OPENAI_KEY = "..."
ds = ray.data.read_parquet("...")
ds = build_llm_processor(
HTTPRequestProcessorConfig(
url="https://api.openai.com/v1/chat/completions",
header=f"Authorization: Bearer {OPENAI_KEY}",
qps=1,
),
preprocess=lambda row: dict(
model="gpt-4o-mini",
messages=row["messages"],
sampling_params=dict(
temperature=0.0,
max_tokens=150,
),
),
postprocess=lambda row: dict(
resp=row["generated_text"]
),
concurrency=8,
)(ds)
ds.write_parquet("...")
Ray Serve LLM
The new ray.serve.llm
provides:
- VLLMDeployment: Manages VLLM engine deployment
- LLMModelRouter: OpenAI-compatible API router
- LLMConfig: Unified configuration for model deployment
- LoRA Support: Multi-adapter sharing with LRU caching
These features allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.
from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig, ModelLoadingConfig
llm_config = LLMConfig(
model_loading_config=ModelLoadingConfig(
served_model_name="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8b-instruct",
),
deployment_config=DeploymentConfig(
autoscaling_config=AutoscalingConfig(
min_replicas=1,
max_replicas=8,
)
),
)
vllm_deployment = VLLMDeployment.options(**llm_config.get_serve_options()).bind(llm_config)
serve.run(vllm_deployment)
Below is a more comprehensive example of using the OpenAI API with Ray Serve.
from ray import serve
from ray.serve.llm import LLMModelRouterDeployment, VLLMDeployment, LLMConfig, ModelLoadingConfig
# Configure multiple models
llm_config1 = LLMConfig(
model_loading_config=ModelLoadingConfig(
served_model_name="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8b-instruct",
),
deployment_config=DeploymentConfig(
autoscaling_config=AutoscalingConfig(
min_replicas=1,
max_replicas=8,
)
),
)
llm_config2 = LLMConfig(
model_loading_config=ModelLoadingConfig(
served_model_name="llama-3.2-3b",
model_source="meta-llama/Llama-3.2-3b-instruct",
),
deployment_config=DeploymentConfig(
autoscaling_config=AutoscalingConfig(
min_replicas=1,
max_replicas=8,
)
),
)
# Create deployments
vllm_deployment1 = VLLMDeployment.options(**llm_config1.get_serve_options()).bind(llm_config1)
vllm_deployment2 = VLLMDeployment.options(**llm_config2.get_serve_options()).bind(llm_config2)
# Create router deployment
llm_app = LLMModelRouterDeployment.options().bind([vllm_deployment1, vllm_deployment2])
# Deploy the application
serve.run(llm_app)
And you can now use an OpenAI API client to interact with the deployed models.
from openai import OpenAI
# Initialize client with your deployment endpoint
client = OpenAI(
base_url="http://localhost:8000",
api_key="fake-key" # The API key is not validated
)
# Chat completion
chat_response = client.chat.completions.create(
model="llama-3.1-8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=150
)
# Text completion
completion_response = client.completions.create(
model="llama-3.2-3b",
prompt="Write a poem about",
temperature=0.7,
max_tokens=150
)
Advanced Example: RAG Pipeline with ray.data.llm
Here is a more complex example that demonstrates how to build a RAG pipeline using
the new ray.data.llm
APIs.
import ray
from ray.data.llm import (
build_llm_processor,
VLLMProcessorConfig,
HTTPRequestProcessorConfig,
)
# Embedding generation
embed_processor_config = VLLMProcessorConfig(
model="google-bert/bert-base-uncased",
task_type="embed",
)
# Vector DB querying
retrieve_processor_config = HTTPRequestProcessorConfig(
url="http://vector-db-endpoint",
header="...",
qps=1,
)
# LLM processing
llm_processor_config = VLLMProcessorConfig(
model="meta-llama/Llama-3.1-70B-Instruct",
engine_kwargs=dict(pipeline_parallel_size=2),
)
# Pipeline construction
ds = ray.data.read_parquet("...")
# Generate embeddings
ds = build_llm_processor(
embed_processor_config,
preprocess=lambda row: dict(prompt=row["question"]),
postprocess=lambda row: dict(embedding=row["embedding"]),
concurrency=2,
)(ds)
# Query vector DB
ds = build_llm_processor(
retrieve_processor_config,
preprocess=lambda row: dict(body=row["embedding"]),
postprocess=lambda row: dict(retrieved=row["text"]),
concurrency=4,
)(ds)
# Generate answers
ds = build_llm_processor(
llm_processor_config,
preprocess=lambda row: dict(
messages=[
{"role": "system", "content": "..."},
{"role": "user", "content": f"{row['question']}\n{row['retrieved']}"},
],
sampling_params=dict(temperature=0.3, max_tokens=250)
),
concurrency=4,
)(ds)
Advanced Example: LoRA serving with ray.serve.llm
from ray import serve
from ray.serve.llm import VLLMDeployment, LLMConfig, ModelLoadingConfig, LoraConfig
from ray.serve import DeploymentConfig, AutoscalingConfig
# Configure the LLM with LoRA support
llm_config = LLMConfig(
model_loading_config=ModelLoadingConfig(
model_source="meta-llama/Llama-3.1-8b-instruct",
served_model_name="llama-3.1-8b"
),
lora_config=LoraConfig(
# Path containing all LoRA adapters
dynamic_lora_loading_path="s3://my-bucket/llama-loras/",
# Maximum number of LoRA adapters that can share a single base model
max_num_adapters_per_replica=4,
),
deployment_config=DeploymentConfig(
autoscaling_config=AutoscalingConfig(
min_replicas=1,
max_replicas=4
)
)
)
# Create and deploy the model
vllm_deployment = VLLMDeployment.options(**llm_config.get_serve_options()).bind(llm_config)
serve.run(vllm_deployment)
ray.serve.llm config-based API
There is also a configuration-based API for serving LLMs, where
the configurations can be declared separately from the application logic.
from pydantic import BaseModel, Field
from typing import List, Union
from ray.serve.llm import VLLMDeployment, LLMConfig
def build_vllm_deployment(llm_config: LLMConfig):
return VLLMDeployment.options(llm_config.get_serve_options()).bind(llm_config)
class LLMServingArgs(BaseModel):
llm_configs: List[Union[str, LLMConfig]] = Field(
description="A list of LLMConfigs, or paths to LLMConfigs, to run.",
)
def build_openai_app(llm_serving_args: LLMServingArgs):
llm_configs = llm_serving_args.llm_configs
llm_deployments = []
for llm_config in llm_configs:
if isinstance(llm_config, str):
llm_config = LLMConfig.from_yaml(llm_config)
llm_deployments.append(build_vllm_deployment(llm_config))
return LLMModelRouterDeployment.bind(llm_deployments)
Sample config.yaml
# Demonstrate inline llm configs in the Serve config
application:
name: llm_app
route_prefix: "/"
import_path: ray.serve.llm:build_openai_app
args:
llm_configs:
- model_loading_config:
model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
accelerator_type: A10G
tensor_parallelism:
degree: 1
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 2
Future Work
- Support for additional LLM inference engines beyond vLLM
- Enhanced monitoring and observability
- Advanced batching and scheduling optimizations
- Additional processor types for specialized workflows