Getting Started with Ray Serve LLM#
Now that we understand the fundamentals, let’s see how to get started with Ray Serve LLM. The process involves three main steps:
Configure your LLM deployment
Deploy the service
Query the deployed model
Shutdown the deployment
Step 1: Configuration#
Let’s create a simple configuration:
#serve_llama.py
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
# Model loading configuration
model_loading_config=dict(
model_id="my-llama", # custom name for the model
model_source="unsloth/Meta-Llama-3.1-8B-Instruct", # huggingface model repo
),
accelerator_type="L4", # device to use (picked from your ray cluster)
## Optional: configure Ray Serve autoscaling
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1, # keep at least 1 replica up to avoid cold starts
max_replicas=2, # no more than 2 replicas to control cost
)
),
# Configure your vLLM engine. Follow the same API as vLLM
# https://docs.vllm.ai/en/stable/configuration/engine_args.html
engine_kwargs=dict(max_model_len=8192),
)
app = build_openai_app({"llm_configs": [llm_config]})
Step 2: Deployment#
Deployment can be done locally or on Anyscale Services:
Local Deployment:
!serve run serve_llama:app --non-blocking
Anyscale Services:
To deploy your LLM with Anyscale Service, configure your cloud and compute configuration and point to your LLM configuration:
# service.yaml
name: deploy-llama-3-8b
image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.
compute_config:
auto_select_worker_config: true
working_dir: .
cloud:
applications:
# Point to your app in your Python module
- import_path: serve_llama:app
Deploy your service:
!anyscale service deploy -f service.yaml
Step 3: Querying#
Once deployed, you can use the OpenAI Python client with base_url pointing to your Ray Serve endpoint.
from openai import OpenAI
from urllib.parse import urljoin
# because deployed locally, we use localhost:8000 and a dummy placeholder API key
base_url = "http://localhost:8000"
token="DUMMY_KEY"
client = OpenAI(base_url= urljoin(base_url, "v1"), api_key=token)
response = client.chat.completions.create(
model="my-llama",
messages=[
{"role": "user", "content": "What's the capital of France?"}
],
stream=True
)
# Stream and print JSON
for chunk in response:
data = chunk.choices[0].delta.content
if data:
print(data, end="", flush=True)
Step 4: Shutdown#
Shutdown a local deployment
!serve shutdown -y
Terminate an Anyscale service:
anyscale service terminate deploy-my-llama