Ray Serve

Ray Serve#

Ray Serve is a highly scalable and flexible model serving library for building online inference APIs that allows you to:

  • Wrap models and business logic as separate serve deployments and connect them together (pipeline, ensemble, etc.)

  • Avoid one large service that’s network and compute bounded and an inefficient use of resources.

  • Utilize fractional heterogeneous resources, which isn’t possible with SageMaker, Vertex, KServe, etc., and horizontally scale withnum_replicas.

  • autoscale up and down based on traffic.

  • Integrate with FastAPI and HTTP.

  • Set up a gRPC service to build distributed systems and microservices.

  • Enable dynamic batching based on batch size, time, etc.

  • Access a suite of utilities for serving LLMs that are inference-engine agnostic and have batteries-included support for LLM-specific features such as multi-LoRA support

🔥 RayTurbo Serve on Anyscale has more functionality on top of Ray Serve:

  • fast autoscaling and model loading to get services up and running even faster with 5x improvements even for LLMs.

  • 54% higher QPS and up-to 3x streaming tokens per second for high traffic serving use-cases with no proxy bottlenecks.

  • replica compaction into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization.

  • zero-downtime incremental rollouts so your service is never interrupted.

  • different environments for each service in a multi-serve application.

  • multi availability-zone aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures.