1. When to Consider Ray Data

1. When to Consider Ray Data#

Consider using Ray Data for your project if it meets one or more of the following criteria:

Challenge

Details

Ray Data Solution

Operating on large datasets and/or models

- Having to load and process massive datasets or models (e.g., >10 TB)
- Having to perform inference with large models (e.g., LLMs) using inference engines

- Distributes data loading and processing across a Ray cluster
- Supports large model inference workloads via ray.data.llm

Efficient hardware utilization across CPUs and GPUs

- Over-provisioning compute to naively partition data
- Performing static resource allocation
- Running execution in full across CPU and GPU stages
- Passing data between heteregenous stages by persisting intermediate results to disk

- Streams data to avoid full materialization in memory
- Enables resource multiplexing across pipeline stages
- Supports autoscaling for both CPU and GPU resources
- Enables pipeline parallelism across heterogeneous hardware with configurable batch sizes

Building reliable pipelines

- Needing to handle failures such as network errors, spot instance preemptions, and hardware faults

- Leverages Ray Core’s fault-tolerance mechanisms to recover from failed tasks
- Supports driver checkpointing (via RayTurbo) for comprehensive pipeline reliability

Handling unstructured data efficiently

- Suboptimal resource allocation due to data skew in input data sizes (e.g. vary input video lengths)

- Automatically reshapes data into uniformly sized blocks to improve processing efficiency

Example batch inference pipeline with Ray Data over a large dataset using heterogeneous cluster of CPUs and GPUs.