6. Materializing data

6. Materializing data#

You can choose to materialize the entire dataset into the Ray object store which is distributed across the cluster, primarily in memory and secondarily spilling to disk.

To materialize the dataset, we can use the materialize() method.

Use this only when you require the full dataset to compute downstream outputs.

ds_preds.materialize()

materialize() triggers the execution. The logs should show the execution plan of Dataset:

Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)]