Data: Folders, Tables, & Blobs

Open In Colab

Runhouse has several abstractions to provide a simple interface for storing, recalling, and moving data between the user’s laptop, remote compute, cloud storage, and specialized storage (e.g. data warehouses).

The Folder, Table, and Blob APIs provide least-common-denominator APIs across providers, allowing users to easily specify the actions they want to take on the data without needed to dig into provider-specific APIs.

Install Runhouse and Setup Cluster

Optionally, login to Runhouse to sync credentials.

We also construct a Runhouse cluster object that we will use throughout the tutorial. We won’t go in depth about clusters in this tutorial, but you can refer to Getting Started for setup instructions, or the Compute API tutorial for a more in-depth walkthrough of clusters.

Folders

The Runhouse Folder API allows for creating references to folders, and syncing them between local, remote clusters, or file storage (S3, GS, Azure).

Let’s construct a sample dummy folder locally, that we’ll use to demonstrate.

To create a folder object, use the rh.folder() factory function, and use .to() to send the folder to a remote cluster.

INFO | 2023-08-29 19:45:52.597164 | Copying folder from file:///Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_folder to: cpu-cluster, with path: sample_folder
INFO | 2023-08-29 19:45:54.633598 | Running command on cpu-cluster: ls sample_folder
0.txt
1.txt
2.txt
3.txt
4.txt
[(0, '0.txtn1.txtn2.txtn3.txtn4.txtn', '')]

You can also send the folder to file storage, such as S3, GS, and Azure.

INFO | 2023-08-29 19:47:47.618511 | Copying folder from file:///Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_folder to: s3, with path: /runhouse-folder/a6f195296945409da432b2981f984ae7
INFO | 2023-08-29 19:47:47.721743 | Found credentials in shared credentials file: ~/.aws/credentials
INFO | 2023-08-29 19:47:48.796181 | Found credentials in shared credentials file: ~/.aws/credentials
['0.txt', '1.txt', '2.txt', '3.txt', '4.txt']

Similarly, you can send folders from a cluster to file storage, cluster to cluster, or file storage to file storage. These are all done without bouncing the folder off local.

Tables

The Runhouse Table API allows for abstracting tabular data storage, and supports interfaces for HuggingFace, Dask, Pandas, Rapids, and Ray tables (more in progress!).

These can be synced and written down to local, remote clusters, or file storage (S3, GS, Azure).

Let’s step through an example using a Pandas table we upload to our s3 bucket using Runhouse.

INFO | 2023-08-29 19:55:29.834000 | Found credentials in shared credentials file: ~/.aws/credentials
 id grade
  1     a
  2     b
  3     b
  4     a
  5     a
  6     e

To sync over and save the table to a remote cluster, or to local (“here”):

INFO | 2023-08-29 19:59:39.456856 | Copying folder from s3://runhouse-table/sample_table to: cpu-cluster, with path: ~/.cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2
INFO | 2023-08-29 19:59:39.458405 | Running command on cpu-cluster: aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://runhouse-table/sample_table ~/.cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2
download: s3://runhouse-table/sample_table/d68a64f755014c049b6e97b120db5d0f.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/d68a64f755014c049b6e97b120db5d0f.parquet
download: s3://runhouse-table/sample_table/ebf7bbc1b22e4172b162b723b4b234f2.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/ebf7bbc1b22e4172b162b723b4b234f2.parquet
download: s3://runhouse-table/sample_table/53d00aa5fa2148dd9f4d9836f7b6a9be.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/53d00aa5fa2148dd9f4d9836f7b6a9be.parquet
download: s3://runhouse-table/sample_table/2d0aed0ba49d42509ae9124368a74323.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/2d0aed0ba49d42509ae9124368a74323.parquet
download: s3://runhouse-table/sample_table/ea3841db70874ee7aade6ff1299325c5.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/ea3841db70874ee7aade6ff1299325c5.parquet
download: s3://runhouse-table/sample_table/e7a7dce218054b6aa2b0853c12afe952.parquet to .cache/runhouse/82d19ef56425409fb92e5d4dfcd389e2/e7a7dce218054b6aa2b0853c12afe952.parquet
INFO | 2023-08-29 19:59:49.336813 | Copying folder from s3://runhouse-table/sample_table to: file, with path: /Users/caroline/Documents/runhouse/runhouse/docs/notebooks/basics/sample_table

To stream batches of the table, we reload the table object, but with an iterable .data field, using the rh.table constructor and passing in the name.

Note that you can’t directly do this with the original table object, as its .data field is the original data passed in, and not necessarily in an iterable format.

   id grade
0   1     a
1   2     b
   id grade
0   3     b
1   4     a
   id grade
0   5     a
1   6     e

Blobs

The Runhouse Blob API represents an entity for storing arbitrary data. Blobs are associated with a system (local, remote, or file storage), and can be written down or synced to systems.

INFO | 2023-08-29 20:57:10.570715 | Creating new file folder if it does not already exist in path: /Users/caroline/Documents/runhouse/runhouse

To get the contents from a blob, use .fetch():

'[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]'

Now that you understand the basics, feel free to play around with more complicated scenarios! You can also check out our additional API and example usage tutorials on our docs site.

Cluster Termination

 Terminating cpu-cluster