Getting Started with Ray Serve LLM

Getting Started with Ray Serve LLM#

Now that we understand the fundamentals, let’s see how to get started with Ray Serve LLM. The process involves three main steps:

  1. Configure your LLM deployment

  2. Deploy the service

  3. Query the deployed model

  4. Shutdown the deployment

Step 1: Configuration#

Let’s create a simple configuration:

#serve_llama.py
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    # Model loading configuration
    model_loading_config=dict(
        model_id="my-llama", # custom name for the model
        model_source="unsloth/Meta-Llama-3.1-8B-Instruct", # huggingface model repo
    ),
    accelerator_type="L4", # device to use (picked from your ray cluster)
    ## Optional: configure Ray Serve autoscaling
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, # keep at least 1 replica up to avoid cold starts
            max_replicas=2, # no more than 2 replicas to control cost
        )
    ),
    # Configure your vLLM engine. Follow the same API as vLLM
    # https://docs.vllm.ai/en/stable/configuration/engine_args.html
    engine_kwargs=dict(max_model_len=8192),
)

app = build_openai_app({"llm_configs": [llm_config]})

Step 2: Deployment#

Deployment can be done locally or on Anyscale Services:

Local Deployment:

!serve run serve_llama:app --non-blocking

Anyscale Services:

To deploy your LLM with Anyscale Service, configure your cloud and compute configuration and point to your LLM configuration:

# service.yaml
name: deploy-llama-3-8b
image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.
compute_config:
  auto_select_worker_config: true 
working_dir: .
cloud:
applications:
  # Point to your app in your Python module
  - import_path: serve_llama:app

Deploy your service:

!anyscale service deploy -f service.yaml

Step 3: Querying#

Once deployed, you can use the OpenAI Python client with base_url pointing to your Ray Serve endpoint.

from openai import OpenAI
from urllib.parse import urljoin

# because deployed locally, we use localhost:8000 and a dummy placeholder API key
base_url = "http://localhost:8000"
token="DUMMY_KEY"
client = OpenAI(base_url= urljoin(base_url, "v1"), api_key=token)

response = client.chat.completions.create(
    model="my-llama",
    messages=[
        {"role": "user", "content": "What's the capital of France?"}
    ],
    stream=True
)

# Stream and print JSON
for chunk in response:
    data = chunk.choices[0].delta.content
    if data:
        print(data, end="", flush=True)

Step 4: Shutdown#

Shutdown a local deployment

!serve shutdown -y

Terminate an Anyscale service:

anyscale service terminate deploy-my-llama