Ray Serve LLM + Anyscale Architecture#

Here is a diagram of how Ray Serve LLM + Anyscale provides a production-grade solution to your LLM deployment:

Notes:

  • The above shows only one replica per model, but Ray Serve can easily scale to deploying multiple replicas.

Ray Serve LLM + Anyscale provides a production-grade solution through three integrated components:

1. Ray Serve for Orchestration#

Ray Serve handles the orchestration and scaling of your LLM deployment:

  • Automatic scaling: Adds/removes model replicas based on traffic

  • Load balancing: Distributes requests across available replicas

  • Unified multi-model deployment: Deploy and manage multiple models

  • OpenAI-compatible API: Drop-in replacement for OpenAI clients

Here is a diagram of how Ray Serve LLM interact with a client’s request

2. vLLM as the inference engine#

LLM inference is a non-trivial problem that requires tuning low-level hardware use and high-level algorithms. An inference engine abstracts this complexity and optimizes model execution. Ray Serve LLM natively integrates vLLM as its inference engine for several reasons:

  • Fast GPU computation with CUDA kernels specifically optimized for LLM inference.

  • Continuous batching: Continuously schedule tokens to be processed to maximize GPU utilization.

  • Smart memory use: Optimize memory usage with state-of-the-art algorithms like PagedAttention

Ray Serve LLM gives you high flexibility on how to configure your vLLM engine (more on that later).

3. Anyscale for Infrastructure#

Anyscale provides managed infrastructure and enterprise features:

  • Managed infrastructure: Optimized Ray clusters in your cloud

  • Cost optimization: Pay-as-you-go, scale-to-zero

  • Enterprise security: VPC, SSO, audit logs

  • Seamless scaling: Handle traffic spikes automatically