Deploying to Anyscale Services#
For production deployment, we’ll use Anyscale Services to deploy our Ray Serve app to a dedicated cluster. The great news is that no code changes are needed - we can use the exact same LLM configuration!
What is an Anyscale Service?#
An Anyscale Service is a managed deployment that provides:
Dedicated Infrastructure: Your own Ray cluster in the cloud
Automatic Scaling: Handles traffic spikes and load balancing
Fault Tolerance: Resilient against node failures and rolling updates
Enterprise Features: Security, monitoring, and compliance
Setting up the Configuration File#
Let’s create the service configuration:
# service.yaml
name: deploy-llama-3-70b
image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.
compute_config:
auto_select_worker_config: true
working_dir: .
cloud:
applications:
# Point to your app in your Python module
- import_path: serve_llama_3_1_70b:app
Launching the Service#
Now let’s deploy our service to Anyscale:
!anyscale service deploy -f service.yaml --env HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>
Running Inference on Anyscale#
Once deployed, you’ll get an endpoint and authentication token. Let’s see how to use them:
from openai import OpenAI
client = OpenAI(
base_url="https://deploy-llama-3-70b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1",
api_key="2YKUt_IJZ8q8GWT5VPHVitzsHKsddoL6mSszJxzwe5A"
)
response = client.chat.completions.create(
model="my-llama-3.1-70b",
messages=[{"role": "user", "content": "Tell me about Anyscale!"}],
stream=True
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Shutting Down the Service#
When you’re done with your service:
!anyscale service terminate -n deploy-llama-3-70b