6. When to use Ray Data#

Ray Data is especially performant when needing to:

  • run data processing in a streaming fashion

  • run across a large dataset

  • run inside a heterogeneous cluster of CPUs and GPUs.

Here is one use case for Batch Inference with Ray Data over a large dataset:

Ray Data also integrates seamlessly with Ray Train, making it an optimal choice for data preprocessing in machine learning training pipelines. Especially when you need to:

  • Independently scale out data loading and transformation from model training.

  • Enable fault tolerance for model training.

7. Ray Data in Production#

  1. Runway AI is using Ray Data to scale its ML workloads. See this interview with Runway AI to learn more.

  2. Netflix is using Ray Data for multi-modal batch inference pipelines. See this talk at the Ray Summit 2024 to learn more.

  3. Spotify uses Ray Data for large-scale data processing. See this talk at the Ray Summit 2023 to learn more.

8. Upcoming Features in Ray Data#

Here are some relevant upcoming features in Ray Data:

For structured data:

  • improved groupby and map_groups performance

  • using parquet metadata for computing statistics like count

  • enabling predicate pushdown for parquet files when calling filter

  • supporting join and merge operations

  • optimizing performance of the Preprocessor API for distributed feature engineering

  • running spark on Ray more seamlessly

For all data types:

  • data checkpointing for fault tolerance

  • optimizing data connectors

  • concurrent execution of multiple datasets

# Run this cell for file cleanup 
!rm {storage_folder}/adjusted_data.parquet
!rm -rf {storage_folder}/adjusted_data_ray/