Data storage

Data storage#

import shutil

# Save to artifact storage.
embeddings_path = os.path.join("/mnt/cluster_storage", "doggos/embeddings")
if os.path.exists(embeddings_path):
    shutil.rmtree(embeddings_path)  # clean up
embeddings_ds.write_parquet(embeddings_path)

(autoscaler +2m12s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +2m17s) [autoscaler] [4xT4:48CPU-192GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
(autoscaler +2m17s) [autoscaler] [4xT4:48CPU-192GB|g4dn.12xlarge] [us-west-2a] [on-demand] Launched 1 instance.
(autoscaler +2m57s) [autoscaler] Cluster upscaled to {104 CPU, 8 GPU}.

(autoscaler +6m52s) [autoscaler] Downscaling node i-0b5c2c9a5a27cfba2 (node IP: 10.0.27.32) due to node idle termination.
(autoscaler +6m52s) [autoscaler] Cluster resized to {56 CPU, 4 GPU}.

🗂️ Storage on Anyscale

You can always store to the data inside any storage buckets but Anyscale offers a default storage bucket to make things easier. You also have plenty of other storage options as well, for example, shared at the cluster, user and cloud levels.

Note: ideally you would store these embeddings in a vector database like efficient search, filter, index, etc., but for this tutorial, just store to a shared file system.