6. When to use Ray Data#
Ray Data is especially performant when needing to:
run data processing in a streaming fashion
run across a large dataset
run inside a heterogeneous cluster of CPUs and GPUs.
Here is one use case for Batch Inference with Ray Data over a large dataset:
Ray Data also integrates seamlessly with Ray Train, making it an optimal choice for data preprocessing in machine learning training pipelines. Especially when you need to:
Independently scale out data loading and transformation from model training.
Enable fault tolerance for model training.
7. Ray Data in Production#
Runway AI is using Ray Data to scale its ML workloads. See this interview with Runway AI to learn more.
Netflix is using Ray Data for multi-modal batch inference pipelines. See this talk at the Ray Summit 2024 to learn more.
Spotify uses Ray Data for large-scale data processing. See this talk at the Ray Summit 2023 to learn more.
8. Upcoming Features in Ray Data#
Here are some relevant upcoming features in Ray Data:
For structured data:
improved
groupbyandmap_groupsperformanceusing parquet metadata for computing statistics like
countenabling predicate pushdown for parquet files when calling
filtersupporting
joinandmergeoperationsoptimizing performance of the
PreprocessorAPI for distributed feature engineeringrunning spark on Ray more seamlessly
For all data types:
data checkpointing for fault tolerance
optimizing data connectors
concurrent execution of multiple datasets
# Run this cell for file cleanup
!rm {storage_folder}/adjusted_data.parquet
!rm -rf {storage_folder}/adjusted_data_ray/