09 · Build the DataLoader with prepare_data_loader()#
Now let’s define a helper that builds the MNIST DataLoader and makes it Ray Train–ready.
Apply standard preprocessing:
ToTensor()→ convert PIL images to PyTorch tensorsNormalize((0.5,), (0.5,))→ center and scale pixel values
Construct a PyTorch
DataLoaderwith batching and shuffling.Finally, wrap it with
prepare_data_loader(), which automatically:Moves each batch to the correct device (GPU or CPU).
Copies data from host memory to device memory as needed.
Injects a PyTorch
DistributedSamplerwhen running with multiple workers, so that each worker processes a unique shard of the dataset.
This utility lets you use the same DataLoader code whether you’re training on one GPU or many — Ray handles the distributed sharding and device placement for you.
# 09. Build a Ray Train–ready DataLoader for MNIST
def build_data_loader_ray_train(batch_size: int) -> torch.utils.data.DataLoader:
# Define preprocessing: convert to tensor + normalize pixel values
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
# Load the MNIST training set from persistent cluster storage
train_data = MNIST(
root="/mnt/cluster_storage/data",
train=True,
download=True,
transform=transform,
)
# Standard PyTorch DataLoader (batching, shuffling, drop last incomplete batch)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True, drop_last=True)
# prepare_data_loader():
# - Adds a DistributedSampler when using multiple workers
# - Moves batches to the correct device automatically
train_loader = ray.train.torch.prepare_data_loader(train_loader)
return train_loader
Ray Data integration
This step isn’t necessary if you are integrating your Ray Train workload with Ray Data. It’s especially useful if preprocessing is CPU-heavly and user wants to run preprocessing and training of separate instances.