Deploying to Anyscale Services#

For production deployment, we’ll use Anyscale Services to deploy our Ray Serve app to a dedicated cluster. The great news is that no code changes are needed - we can use the exact same LLM configuration!

What is an Anyscale Service?#

An Anyscale Service is a managed deployment that provides:

  • Dedicated Infrastructure: Your own Ray cluster in the cloud

  • Automatic Scaling: Handles traffic spikes and load balancing

  • Fault Tolerance: Resilient against node failures and rolling updates

  • Enterprise Features: Security, monitoring, and compliance

Setting up the Configuration File#

Let’s create the service configuration:

# service.yaml
name: deploy-llama-3-70b
image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.
compute_config:
  auto_select_worker_config: true 
working_dir: .
cloud:
applications:
  # Point to your app in your Python module
  - import_path: serve_llama_3_1_70b:app

Launching the Service#

Now let’s deploy our service to Anyscale:

!anyscale service deploy -f service.yaml --env HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>

Running Inference on Anyscale#

Once deployed, you’ll get an endpoint and authentication token. Let’s see how to use them:

from openai import OpenAI

client = OpenAI(
    base_url="https://deploy-llama-3-70b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1",
    api_key="2YKUt_IJZ8q8GWT5VPHVitzsHKsddoL6mSszJxzwe5A"
)

response = client.chat.completions.create(
    model="my-llama-3.1-70b",
    messages=[{"role": "user", "content": "Tell me about Anyscale!"}],
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Shutting Down the Service#

When you’re done with your service:

!anyscale service terminate -n deploy-llama-3-70b