2. Loading Data#
Our Dataset is the New York City Taxi & Limousine Commission’s Trip Record Data
Dataset features
Column |
Description |
|---|---|
|
Float representing trip distance in miles. |
|
The number of passengers |
|
TLC Taxi Zone in which the taximeter was engaged |
|
TLC Taxi Zone in which the taximeter was disengaged |
|
A numeric code signifying how the passenger paid for the trip. |
|
Total amount of all tolls paid in trip. |
|
Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
|
The total amount charged to passengers. Does not include cash tips. |
COLUMNS = [
"trip_distance",
"passenger_count",
"PULocationID",
"DOLocationID",
"payment_type",
"tolls_amount",
"tip_amount",
"total_amount",
]
DATA_PATH = "s3://anyscale-public-materials/nyc-taxi-cab"
Let’s read the data for a single month. It takes up to 2 minutes to run.
df = pd.read_parquet(
f"{DATA_PATH}/yellow_tripdata_2011-05.parquet",
columns=COLUMNS,
)
df.head()
Let’s check how much memory the dataset is using.
df.memory_usage(deep=True).sum().sum() / 1024**2
Let’s check how many files there are in the dataset
!aws s3 ls s3://anyscale-public-materials/nyc-taxi-cab/ --human-readable | wc -l
We are not making use of all the columns and are already consuming ~1GB of data per file -> will quickly become a problem if you want to scale to entire dataset (~155 files) if we are running on a small node.
Let’s instead make use of a distributed data preprocessing library like Ray Data to load the full dataset in a distributed manner.
ds = ray.data.read_parquet(
DATA_PATH,
columns=COLUMNS,
)
There are Ray data equivalents for common pandas functions like read_csv, read_parquet, read_json, etc.
Refer to the Input/Output docs for a comprehensive list of read functions.
Dataset#
Let’s view our dataset
ds
Ray Data by default adopts lazy execution this means that the data is not loaded into memory until it is needed. Instead only a small part of the dataset is loaded into memory to infer the schema.
A Dataset specifies a sequence of transformations that will be applied to the data.
The data itself will be organized into blocks, where each block is a collection of rows.
The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:
Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.