3. Transforming Data#
Let’s create a simple function to generate features from the data. Here is how we would do so using pandas
def adjust_total_amount(df: pd.DataFrame) -> pd.DataFrame:
df["adjusted_total_amount"] = df["total_amount"] - df["tip_amount"]
return df
df = adjust_total_amount(df)
We can take the same function and apply it to the Ray dataset using map_batches.
map_batches will batch each block of the dataset and apply the function to each batch in parallel.
ds_adjusted = ds.map_batches(adjust_total_amount, batch_format="pandas")
The default batch_format in Ray Data is numpy, which means that the data is returned as a numpy array. For optimal performance, it is recommended to avoid converting the data to pandas dataframes unless necessary.
Let’s add another transformation, for the sake of this example, we will add a simple transformation to calculate the tip percentage.
def compute_tip_percentage(df: pd.DataFrame) -> pd.DataFrame:
df["tip_percentage"] = df["tip_amount"] / df["total_amount"]
return df
df = compute_tip_percentage(df)
We would apply it again using map_batches. Note that we can control certain additional parameters such as the batch size to use.
ds_tip = ds_adjusted.map_batches(compute_tip_percentage, batch_format="pandas", batch_size=1024)
Execution mode#
Most transformations are lazy in Ray Data - i.e. they don’t execute until you either:
write a dataset to storage
explicitly materialize the data
iterate over the dataset (usually when feeding data to model training).
To explicitly materialize a very small subset of the data, you can use the take_batch method.
ds.take_batch()
Let’s view a batch of the transformed data.
ds_tip.take_batch()