Ray Data

Ray Data#

Ray Data not only makes it extremely easy to distribute workloads but also ensures that they with:

  • efficiency: minimize CPU/GPU idle time with heterogeneous resource scheduling.

  • scalability: streaming execution to petabyte-scale datasets, especially when working with LLMs

  • reliability by checkpointing processes, especially when running workloads on spot instances with on-demand fallback.

  • flexibility: connect to data from any source, apply transformations, and save to any format or location for your next workload.

🔥 RayTurbo Data has more functionality on top of Ray Data:

  • accelerated metadata fetching to improve reading from large datasets (start processes earlier).

  • optimized autoscaling where actor pools are scaled up faster, start jobs before entire cluster is ready, etc.

  • high reliability where entire fails jobs (even on spot instances), like head node, cluster, uncaptured exceptions, etc., can resume from checkpoints. OSS Ray can only recover from worker node failures.