2. Loading Data

Contents

2. Loading Data#

Our Dataset is the New York City Taxi & Limousine Commission’s Trip Record Data

Dataset features

Column

Description

trip_distance

Float representing trip distance in miles.

passenger_count

The number of passengers

PULocationID

TLC Taxi Zone in which the taximeter was engaged

DOLocationID

TLC Taxi Zone in which the taximeter was disengaged

payment_type

A numeric code signifying how the passenger paid for the trip.

tolls_amount

Total amount of all tolls paid in trip.

tip_amount

Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.

total_amount

The total amount charged to passengers. Does not include cash tips.

COLUMNS = [
    "trip_distance",
    "passenger_count",
    "PULocationID",
    "DOLocationID",
    "payment_type",
    "tolls_amount",
    "tip_amount",
    "total_amount",
]

DATA_PATH = "s3://anyscale-public-materials/nyc-taxi-cab"

Let’s read the data for a single month. It takes up to 2 minutes to run.

df = pd.read_parquet(
    f"{DATA_PATH}/yellow_tripdata_2011-05.parquet",
    columns=COLUMNS,
)

df.head()

Let’s check how much memory the dataset is using.

df.memory_usage(deep=True).sum().sum() / 1024**2

Let’s check how many files there are in the dataset

!aws s3 ls s3://anyscale-public-materials/nyc-taxi-cab/ --human-readable | wc -l

We are not making use of all the columns and are already consuming ~1GB of data per file -> will quickly become a problem if you want to scale to entire dataset (~155 files) if we are running on a small node.

Let’s instead make use of a distributed data preprocessing library like Ray Data to load the full dataset in a distributed manner.

ds = ray.data.read_parquet(
    DATA_PATH,
    columns=COLUMNS,
)

There are Ray data equivalents for common pandas functions like read_csv, read_parquet, read_json, etc.

Refer to the Input/Output docs for a comprehensive list of read functions.

Dataset#

Let’s view our dataset

ds

Ray Data by default adopts lazy execution this means that the data is not loaded into memory until it is needed. Instead only a small part of the dataset is loaded into memory to infer the schema.

A Dataset specifies a sequence of transformations that will be applied to the data.

The data itself will be organized into blocks, where each block is a collection of rows.

The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:

Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.