Setting up Ray Serve LLM#

Ray Serve LLM provides multiple Python APIs for defining your application. The main abstractions we’ll work with are:

Key Components#

  1. LLMConfig: Configuration object that defines your model, hardware, and deployment settings

  2. build_openai_app: Public function that creates an OpenAI-compatible application from your configuration

  3. Ray Serve: The underlying orchestration layer that handles scaling and load balancing

Configuration for Medium-Sized Models#

For medium-sized models, we need to:

  • Set appropriate accelerator_type for the hardware

  • Configure tensor parallelism with tensor_parallel_size to match the number of GPUs

Let’s create our configuration:

# serve_llama_3_1_70b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-llama-3.1-70b",
        # Or unsloth/Meta-Llama-3.1-70B-Instruct for an ungated model
        model_source="meta-llama/Llama-3.1-70B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=4,
        )
    ),
    accelerator_type="L40S", # Or with similar VRAM like "A100-40G"
    # Type `export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>` in a terminal
    runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(
        max_model_len=32768, # See model's Hugging Face card for max context length
        # Split weights among 8 GPUs in the node
        tensor_parallel_size=8,
    ),
    log_engine_metrics=True,
)

app = build_openai_app({"llm_configs": [llm_config]})

Configuration Breakdown#

Let’s understand each part of our configuration:

Model Loading:

  • model_id: Unique identifier for your model in the API

  • model_source: Hugging Face model path (gated model requires HF token)

  • HF_TOKEN: Hugging Face token for accessing gated models

Hardware Configuration:

  • accelerator_type: GPU type (L40S, A100-40G, etc.)

  • tensor_parallel_size: Number of GPUs to split the model across

Deployment Settings:

  • autoscaling_config: Min/max replicas for horizontal scaling

Monitoring

  • log_engine_metrics: Display LLM-specific metrics (Time to First Toke, Time Per Output Token, Request Per Second…)