1. When to Consider Ray Data#
Consider using Ray Data for your project if it meets one or more of the following criteria:
Challenge |
Details |
Ray Data Solution |
|---|---|---|
Operating on large datasets and/or models |
- Having to load and process massive datasets or models (e.g., >10 TB) |
- Distributes data loading and processing across a Ray cluster |
Efficient hardware utilization across CPUs and GPUs |
- Over-provisioning compute to naively partition data |
- Streams data to avoid full materialization in memory |
Building reliable pipelines |
- Needing to handle failures such as network errors, spot instance preemptions, and hardware faults |
- Leverages Ray Core’s fault-tolerance mechanisms to recover from failed tasks |
Handling unstructured data efficiently |
- Suboptimal resource allocation due to data skew in input data sizes (e.g. vary input video lengths) |
- Automatically reshapes data into uniformly sized blocks to improve processing efficiency |
|
|---|
Example batch inference pipeline with Ray Data over a large dataset using heterogeneous cluster of CPUs and GPUs. |
