Setting up Ray Serve LLM#
Ray Serve LLM provides multiple Python APIs for defining your application. The main abstractions we’ll work with are:
Key Components#
LLMConfig: Configuration object that defines your model, hardware, and deployment settingsbuild_openai_app: Public function that creates an OpenAI-compatible application from your configurationRay Serve: The underlying orchestration layer that handles scaling and load balancing
Configuration for Medium-Sized Models#
For medium-sized models, we need to:
Set appropriate
accelerator_typefor the hardwareConfigure tensor parallelism with
tensor_parallel_sizeto match the number of GPUs
Let’s create our configuration:
# serve_llama_3_1_70b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os
llm_config = LLMConfig(
model_loading_config=dict(
model_id="my-llama-3.1-70b",
# Or unsloth/Meta-Llama-3.1-70B-Instruct for an ungated model
model_source="meta-llama/Llama-3.1-70B-Instruct",
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1,
max_replicas=4,
)
),
accelerator_type="L40S", # Or with similar VRAM like "A100-40G"
# Type `export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>` in a terminal
runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
engine_kwargs=dict(
max_model_len=32768, # See model's Hugging Face card for max context length
# Split weights among 8 GPUs in the node
tensor_parallel_size=8,
),
log_engine_metrics=True,
)
app = build_openai_app({"llm_configs": [llm_config]})
Configuration Breakdown#
Let’s understand each part of our configuration:
Model Loading:
model_id: Unique identifier for your model in the APImodel_source: Hugging Face model path (gated model requires HF token)HF_TOKEN: Hugging Face token for accessing gated models
Hardware Configuration:
accelerator_type: GPU type (L40S, A100-40G, etc.)tensor_parallel_size: Number of GPUs to split the model across
Deployment Settings:
autoscaling_config: Min/max replicas for horizontal scaling
Monitoring
log_engine_metrics: Display LLM-specific metrics (Time to First Toke, Time Per Output Token, Request Per Second…)