3. Transforming Data

Contents

3. Transforming Data#

Let’s create a simple function to generate features from the data. Here is how we would do so using pandas

def adjust_total_amount(df: pd.DataFrame) -> pd.DataFrame:
    df["adjusted_total_amount"] = df["total_amount"] - df["tip_amount"]
    return df

df = adjust_total_amount(df)

We can take the same function and apply it to the Ray dataset using map_batches.

map_batches will batch each block of the dataset and apply the function to each batch in parallel.

ds_adjusted = ds.map_batches(adjust_total_amount, batch_format="pandas")
Note

The default batch_format in Ray Data is numpy, which means that the data is returned as a numpy array. For optimal performance, it is recommended to avoid converting the data to pandas dataframes unless necessary.

Let’s add another transformation, for the sake of this example, we will add a simple transformation to calculate the tip percentage.

def compute_tip_percentage(df: pd.DataFrame) -> pd.DataFrame:
    df["tip_percentage"] = df["tip_amount"] / df["total_amount"]
    return df

df = compute_tip_percentage(df)

We would apply it again using map_batches. Note that we can control certain additional parameters such as the batch size to use.

ds_tip = ds_adjusted.map_batches(compute_tip_percentage, batch_format="pandas", batch_size=1024)

Execution mode#

Most transformations are lazy in Ray Data - i.e. they don’t execute until you either:

  • write a dataset to storage

  • explicitly materialize the data

  • iterate over the dataset (usually when feeding data to model training).

To explicitly materialize a very small subset of the data, you can use the take_batch method.

ds.take_batch()

Let’s view a batch of the transformed data.

ds_tip.take_batch()