6. When to use Ray Data

6. When to use Ray Data#

Ray Data is especially performant when needing to:

run data processing in a streaming fashion
run across a large dataset
run inside a heterogeneous cluster of CPUs and GPUs.

Here is one use case for Batch Inference with Ray Data over a large dataset:

Ray Data also integrates seamlessly with Ray Train, making it an optimal choice for data preprocessing in machine learning training pipelines. Especially when you need to:

Independently scale out data loading and transformation from model training.
Enable fault tolerance for model training.

7. Ray Data in Production#

Runway AI is using Ray Data to scale its ML workloads. See this interview with Runway AI to learn more.
Netflix is using Ray Data for multi-modal batch inference pipelines. See this talk at the Ray Summit 2024 to learn more.
Spotify uses Ray Data for large-scale data processing. See this talk at the Ray Summit 2023 to learn more.

8. Upcoming Features in Ray Data#

Here are some relevant upcoming features in Ray Data:

For structured data:

improved groupby and map_groups performance
using parquet metadata for computing statistics like count
enabling predicate pushdown for parquet files when calling filter
supporting join and merge operations
optimizing performance of the Preprocessor API for distributed feature engineering
running spark on Ray more seamlessly

For all data types:

data checkpointing for fault tolerance
optimizing data connectors
concurrent execution of multiple datasets

# Run this cell for file cleanup 
!rm {storage_folder}/adjusted_data.parquet
!rm -rf {storage_folder}/adjusted_data_ray/

6. When to use Ray Data

Contents

6. When to use Ray Data#

7. Ray Data in Production#

8. Upcoming Features in Ray Data#