Getting past the AI hackathon cold start problem — Remote cluster auto-setup using Runhouse
Reproducible, cloud-agnostic GPU setup and remote code deployment using Runhouse, a walkthrough of Hugging Face’s Keras Dreambooth Event
AI Hackathons are a great way to get involved with the AI community and try out new technologies and tools for custom use cases; oftentimes, these events even partner with cloud providers to give participants free compute! However, as eager as we all are to dive straight into the ML code to begin model training and producing results, there’s always a messy, time consuming hurdle we need to fit get past — getting our remote dev environment set up.
Just last month, I participated in a Hugging Face community event, for fine-tuning Stable Diffusion (a text-to-image model) using Keras Dreambooth, where some free LambdaLabs compute was supplied. While we were provided with some very comprehensive starter code for actually running DreamBooth, there was quite a long list of instructions for getting set up with the remote LambdaLabs cloud compute before we could actually run any code! We needed to become familiar with cloud-provider specific UI (which varies across hackathons) to launch and SSH into instances, manually install requirements and set environment variables, and also shuttle data and code between local and remote hardware.
Runhouse facilitates this process by providing a local, cloud-agnostic Python interface for launching and setting up remote hardware and data, in a dependable and reproducible way. Runhouse helps to bridge the gap between local and remote code development in ML so that you never need to manually SSH into a cluster, or type the same setup code twice.
More concretely, Runhouse lets you
- Automate reproducible hardware setup (cloud or on-prem instance), all in local Python code, without ever needing to launch an instance through cloud UI or SSH into the instance
- Seamlessly sync local data and code to/from remote clusters or storage. For example, access local training images on a remote cluster for preprocessing.
- Run locally defined functions on remote clusters. Plus, if you have a Runhouse account (free!), save and reuse these functions across local and remote environments without needing to redefine them.
To see how Runhouse achieves this, let’s jump into some sample code from the Hugging Face Keras DreamBooth event, and see how in just a few extra lines of local code, Runhouse lets you consistently and automatically set up and deploy code to your remote compute environment. The full example notebook can be found here.
Runhouse Installation and Setup
To install Runhouse:
pip install runhouse
To sync cloud credentials to be able to launch AWS, GCP, Azure, or LambdaLabs clusters, run sky check and follow the instructions. No additional setup is necessary if using an on-prem or local cluster.
Runhouse lets you launch remote cloud instances and dependable configure reproducible environment dependencies and variables, all through local Python code. No more manually launching clusters from cloud UI, or typing “erase data on instance” to terminate your instance.
To spin up an on-demand cluster and create a local reference to the cluster:
Now that the cluster is up and running, setting up the environment is as easy as gpu.run(list_of_cli_commands) . Runhouse uses gRPC to run this on your cluster, so that you don’t need to SSH into the machine yourself.
Now, to use our local dataset of images for fine-tuning on the cluster, we can use Runhouse to shuttle over these images to the remote cluster, using
Now that the images are on the cluster, we’d like to assemble them on the cluster as well. To do so, take our local function
assemble_dataset, wrap it in a Runhouse function, and send it to our GPU. The new function is called just as we would the original function, but the magic is that it runs on our remote hardware, rather than our local environment.
In addition to folders, Runhouse also lets you sync tables and blob storage, between file storage (like S3) and local or remote hardware.
Similarly, we move our Dreambooth training code into a
train_dreambooth function, so that we can wrap it as a Runhouse function to send to our cluster. As before, we use the Runhouse function as if it were a local function, but all the training happens on our remote GPU.
Below, we define an inference function run_inference, and show how we can perform inference remotely as well, while returning the results back to the local side for us to visualize.
While the inference is run remotely, the resulting
inference_images is a local variable which we can use to view the results locally! (Alternatively, if you prefer to leave the results on your cluster, you can use
.remote() to get an object ref to the images on the cluster).
Terminating the Instance
To terminate the instance once you’re done, simply run
gpu.teardown() in Python, or
sky down gpu in CLI.
Resources & More Information
The full tutorial notebook running the Keras Dreambooth code from the event can be found here. For more examples of using Runhouse for Hugging Face autosetup, check out our HF Transformers and Accelerate integrations.