Runhouse has several abstractions to provide a simple interface for storing, recalling, and moving data between the user’s laptop, remote compute, cloud storage, and specialized storage (e.g. data warehouses).
The Folder, Table, and Blob APIs provide least-common-denominator APIs across providers, allowing users to easily specify the actions they want to take on the data without needed to dig into provider-specific APIs.
Optionally, login to Runhouse to sync credentials.
We also construct a Runhouse cluster object that we will use throughout the tutorial. We won’t go in depth about clusters in this tutorial, but you can refer to Getting Started for setup instructions, or the Compute API tutorial for a more in-depth walkthrough of clusters.
The Runhouse Folder API allows for creating references to folders, and syncing them between local, remote clusters, or file storage (S3, GS, Azure).
Let’s construct a sample dummy folder locally, that we’ll use to demonstrate.
To create a folder object, use the
rh.folder() factory function, and
.to() to send the folder to a remote cluster.
INFO | 2023-08-29 19:45:52.597164 | Copying folder from file:///Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_folder to: cpu-cluster, with path: sample_folder INFO | 2023-08-29 19:45:54.633598 | Running command on cpu-cluster: ls sample_folder
0.txt 1.txt 2.txt 3.txt 4.txt
[(0, '0.txtn1.txtn2.txtn3.txtn4.txtn', '')]
You can also send the folder to file storage, such as S3, GS, and Azure.
INFO | 2023-08-29 19:47:47.618511 | Copying folder from file:///Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_folder to: s3, with path: /runhouse-folder/a6f195296945409da432b2981f984ae7 INFO | 2023-08-29 19:47:47.721743 | Found credentials in shared credentials file: ~/.aws/credentials INFO | 2023-08-29 19:47:48.796181 | Found credentials in shared credentials file: ~/.aws/credentials
['0.txt', '1.txt', '2.txt', '3.txt', '4.txt']
Similarly, you can send folders from a cluster to file storage, cluster to cluster, or file storage to file storage. These are all done without bouncing the folder off local.
The Runhouse Table API allows for abstracting tabular data storage, and supports interfaces for HuggingFace, Dask, Pandas, Rapids, and Ray tables (more in progress!).
These can be synced and written down to local, remote clusters, or file storage (S3, GS, Azure).
Let’s step through an example using a Pandas table we upload to our s3 bucket using Runhouse.
INFO | 2023-08-29 19:55:29.834000 | Found credentials in shared credentials file: ~/.aws/credentials
id grade 1 a 2 b 3 b 4 a 5 a 6 e
To sync over and save the table to a remote cluster, or to local (“here”):
INFO | 2023-08-29 19:59:39.456856 | Copying folder from s3://runhouse-table/sample_table to: cpu-cluster, with path: ~/.cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2 INFO | 2023-08-29 19:59:39.458405 | Running command on cpu-cluster: aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://runhouse-table/sample_table ~/.cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2
download: s3://runhouse-table/sample_table/d68a64f755014c049b6e97b120db5d0f.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/d68a64f755014c049b6e97b120db5d0f.parquet download: s3://runhouse-table/sample_table/ebf7bbc1b22e4172b162b723b4b234f2.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/ebf7bbc1b22e4172b162b723b4b234f2.parquet download: s3://runhouse-table/sample_table/53d00aa5fa2148dd9f4d9836f7b6a9be.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/53d00aa5fa2148dd9f4d9836f7b6a9be.parquet download: s3://runhouse-table/sample_table/2d0aed0ba49d42509ae9124368a74323.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/2d0aed0ba49d42509ae9124368a74323.parquet download: s3://runhouse-table/sample_table/ea3841db70874ee7aade6ff1299325c5.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/ea3841db70874ee7aade6ff1299325c5.parquet download: s3://runhouse-table/sample_table/e7a7dce218054b6aa2b0853c12afe952.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/e7a7dce218054b6aa2b0853c12afe952.parquet
INFO | 2023-08-29 19:59:49.336813 | Copying folder from s3://runhouse-table/sample_table to: file, with path: /Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_table
To stream batches of the table, we reload the table object, but with an
.data field, using the
rh.table constructor and passing
in the name.
Note that you can’t directly do this with the original table object, as
.data field is the original
data passed in, and not
necessarily in an iterable format.
id grade 0 1 a 1 2 b id grade 0 3 b 1 4 a id grade 0 5 a 1 6 e
The Runhouse Blob API represents an entity for storing arbitrary data. Blobs are associated with a system (local, remote, or file storage), and can be written down or synced to systems.
INFO | 2023-08-29 20:57:10.570715 | Creating new file folder if it does not already exist in path: /Users/caroline/Documents/runhouse/runhouse
To get the contents from a blob, use
'[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]'
Now that you understand the basics, feel free to play around with more complicated scenarios! You can also check out our additional API and example usage tutorials on our docs site.
⠹ Terminating cpu-cluster