1. When to Consider Ray Data

1. When to Consider Ray Data#

Consider using Ray Data for your project if it meets one or more of the following criteria:

Challenge	Details	Ray Data Solution
Operating on large datasets and/or models	- Having to load and process massive datasets or models (e.g., >10 TB) - Having to perform inference with large models (e.g., LLMs) using inference engines	- Distributes data loading and processing across a Ray cluster - Supports large model inference workloads via ray.data.llm
Efficient hardware utilization across CPUs and GPUs	- Over-provisioning compute to naively partition data - Performing static resource allocation - Running execution in full across CPU and GPU stages - Passing data between heteregenous stages by persisting intermediate results to disk	- Streams data to avoid full materialization in memory - Enables resource multiplexing across pipeline stages - Supports autoscaling for both CPU and GPU resources - Enables pipeline parallelism across heterogeneous hardware with configurable batch sizes
Building reliable pipelines	- Needing to handle failures such as network errors, spot instance preemptions, and hardware faults	- Leverages Ray Core’s fault-tolerance mechanisms to recover from failed tasks - Supports driver checkpointing (via RayTurbo) for comprehensive pipeline reliability
Handling unstructured data efficiently	- Suboptimal resource allocation due to data skew in input data sizes (e.g. vary input video lengths)	- Automatically reshapes data into uniformly sized blocks to improve processing efficiency


Example batch inference pipeline with Ray Data over a large dataset using heterogeneous cluster of CPUs and GPUs.