A cartoon woman with gray hair and a wizard robe and hat jumping over a hurdle

Image generated with Stable Diffusion

The Quickest AWS SageMaker Deployment in Existence

Deploy and Run Your Code on SageMaker in 10 Minutes with Runhouse

Josh Lewittes

CTO @ 🏃‍♀️Runhouse🏠

Matt Kandler

Engineer @ 🏃‍♀️Runhouse🏠

December 12, 2023

Why AWS SageMaker? Why now?
SageMaker Onboarding in 10 Minutes with rh.SageMakerCluster
Simple SageMaker Inference Example
Advantages of Using Runhouse with SageMaker
Making ML Infra Fast and Homey

We’re excited to announce AWS SageMaker support in Runhouse, aimed to unlock SageMaker’s unique infrastructure advantages without the typical onboarding lift, and improve ergonomics for existing SageMaker users. As usual, the essence of our approach is granting you accessibility and debuggability without the onramp or restrictions of conforming your code to infra-specific APIs. We’ll dive into detailed use cases and code samples, but first, let’s discuss who this is for and why we’ve made it such a high priority.

The examples referenced in this post are publicly available in this Github repo.

Why AWS SageMaker? Why now?

SageMaker has a complex history, and frankly wasn’t a compute platform we expected to prioritize adding to Runhouse this early. But we’ve found that some little-known value drivers in SageMaker make it a compelling option for lean ML teams, and saw an opportunity to significantly improve the experience for existing SageMaker users.

Rather than a single tool, SageMaker is an out of the box “Machine Learning Platform” offered by AWS, reminiscent of the ML Platforms publicized by Meta, Uber, Spotify, and others circa 2018. It’s often picked up as a default option for new ML teams bootstrapping their infra or enterprise teams aligning on a centralized platform. But in 2023 with a slew of competing options, it’s unclear to most whether they should be using SageMaker despite a wealth of guides and blog posts about it. It’s a complex suite of products, and considering its reputation for having a ~6 month onramp (which is now optional, stay tuned!), you’ll want to look before you leap. We’ve also found that many startups, even those with ML infrastructure experts, aren’t aware of some killer features of SageMaker which could dramatically improve their stack.

SageMaker’s more well-known high-level competitive value drivers (i.e. not diving into each of its subcomponents) lean organizational:

Centralization - Time-tested offerings for nearly every piece of the ML platform within AWS. This is pretty simple. If you have credits, discounts, strict vendor constraints, or just aren’t interested in entertaining many options for each piece of the stack, here’s everything in one place.
Admin controls and defaults - Enterprise teams can finely control how expensive resources are used, with usage limits and auto-stop by default in most places. With SageMaker, teams are far less likely to leave a GPU up for months by accident, and you can terminate an accidentally long-running GPU notebook more confidently than a random instance in EC2.

However, there is also unique value from SageMaker which is infrastructural, and far less well known.

Scalable semi-serverless orchestrated compute - SageMaker’s compute model can be seen almost like a big shared Kubernetes cluster, but you get it without the management overhead or separate cost of a managed solution like EKS or ECS (though you still pay for SageMaker itself). You benefit from tapping into a large pool of compute so parallelism is nearly always available, rather than relying on a scheduler to queue your jobs, which adds latency, or an autoscaler to provision new instances, which adds failures and management overhead.

Suppose you have a training service which typically handles 1-2 jobs at once, and all of a sudden receive 4 requests, SageMaker would launch them in parallel without issue, whereas launching them one by one in EC2 or a two-node Kubernetes cluster would not be fun. Like a container orchestrator, SageMaker can also launch jobs from Docker images rather than Machine images, which is far easier and cheaper to manage.
GPUs are pooled in AWS SageMaker separately from EC2, and anecdotally, we’ve observed that they’re more available in SageMaker. This makes sense: the compute model of SageMaker is a large shared pool with ephemeral compute being released back into the pool constantly, whereas EC2 VMs tend to be longer lived.

These benefits are attractive, but come with complexity. SageMaker takes the approach of offering the complete cast of highly-specialized characters you’d find in ML platforms: a notebook service, a model registry, an orchestrator, an inference service, and much more (see this AI Infrastructure Alliance Landscape for a complete picture). All this with relatively prescriptive APIs reminiscent of what you’d find offered by an internal ML Platform team. These give the confidence of a system stress-tested at scale, but also make the onramp a 6-9 month ordeal of translating code and navigating complex behavior within the systems themselves.

SageMaker Onboarding in 10 Minutes with rh.SageMakerCluster

Runhouse (rh.SageMakerCluster) is an abstraction in front of SageMaker that allows you, like other other Runhouse compute abstractions, to dispatch arbitrary code or data to SageMaker compute through a simple and ergonomic API. This saves you the need to migrate or conform your existing code to the SageMaker APIs - a task that not only takes time, but also leads to code duplication and forking if you also use any other infrastructure. It’s open-source code using the SageMaker APIs locally with your own API keys and SageMaker setup, and doesn’t require special permissions, enablements, or external vendors. If you’re already a SageMaker user, you can use SageMakerCluster immediately. It runs on top of SageMaker Training, primarily because this is the most flexible form of compute in SageMaker, but you can run inference, training, preprocessing, HPO, or any other arbitrary Python on the SageMakerCluster. Just send over rh.functions or rh.modules like you would to a static rh.cluster or on_demand_cluster.

Simple SageMaker Inference Example

All the code below can be found in this Github repo containing some common use cases for SageMaker and examples implementing them with Runhouse. In this post, we'll look at a simple inference service, and explore more complex examples in subsequent posts.

Before you begin, run:

pip install runhouse[sagemaker]

and see our quick SageMaker Hardware Setup walkthrough to confirm your environment is set up properly.

Here’s all the code you need to deploy a simple Stable Diffusion inference service on a GPU in AWS SageMaker with Runhouse:

import runhouse as rh
from diffusers import StableDiffusionPipeline


def sd_generate_image(prompt):
    model = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base").to("cuda")
    return model(prompt).images[0]


if __name__ == "__main__":
    sm_gpu_cluster = rh.sagemaker_cluster(name="rh-sagemaker-gpu",
                                          instance_type="ml.g5.4xlarge",
                                          profile="sagemaker").up_if_not().save()

    # Create a Stable Diffusion microservice running on a SageMaker GPU
    sd_generate = rh.function(sd_generate_image).to(sm_gpu_cluster)

    # Call the service with a prompt, which will run remotely on the GPU
    img = sd_generate("A hot dog made out of matcha.")
    img.show()

Explore further: Source code

This code launches our desired SageMaker GPU instance, creates a Runhouse function object which wraps our existing code, and sends that function to the GPU as a microservice. Now whenever we call the local r sd_generatefunction, it makes an HTTP call to the microservice running on the GPU, passing the prompt text and receiving a PIL image. Let’s walk through it line by line:

sm_gpu_cluster = rh.sagemaker_cluster(name="rh-sagemaker-gpu",
                                      instance_type="ml.g5.4xlarge",
                                      profile="sagemaker").up_if_not().save()

The first thing we do is create a cluster object, which represents a new SageMaker instance based on the instance type and other specs provided. Runhouse provides you with the flexibility to configure the SageMaker compute according to your use case. In this case since we are standing up an inference service and would like to keep it running for an indefinite amount of time. We can use the default autostop_mins=-1. For the full list of configuration options, see the cluster factory documentation.

We then launch the compute (if it is not already running), create a new SSM session between your local machine and the SageMaker instance, and then create an SSH tunnel on top of the SSM session (see our documentation for more info on configuring SSM). Once the connection is made, you can SSH directly onto the instance (as easy as: ssh rh-sagemaker-gpu) and make requests to a lightweight HTTP server which Runhouse starts on the instance.

# Create a Stable Diffusion microservice running on a SageMaker GPU
sd_generate = rh.function(sd_generate_image).to(sm_gpu_cluster)

We then create a Runhouse function (or microservice), which handles receiving the prompt, calling the model, and returning the output image. We call .to() on the function to deploy it to the SageMakerCluster.

# Call the service with a prompt, which will run remotely on the GPU
img = sd_generate("A hot dog made out of matcha.")
img.show()

After sending our function to SageMaker, we get back a Python callable object. This behaves exactly as we would expect it to if we were calling it locally, accepting the same inputs, producing the same outputs, and streaming stdout and logs back to be printed locally.

Advantages of Using Runhouse with SageMaker

If you’re considering standing up your own Kubernetes or ECS cluster for ML or you’d like to access SageMaker’s GPU availability, you should take 10 minutes to try SageMaker with Runhouse.

Access to SageMaker’s unique infra without migrating your existing code

Runhouse allows you to onboard to SageMaker in a matter of minutes, not months. You’ll retain the ability to run on any other compute (now or as your stack evolves) by leaving your code infra-agnostic, and interact with the compute from notebooks, IDEs, research, pipeline DAG, or any Python interpreter. Plus, you can easily integrate and adopt the complex world of SageMaker progressively over time as needed. We don’t hide the underlying SageMaker APIs if you want to reach deeper, such as using your own estimator.

Superuser SageMaker usage out of the box

We work directly with AWS to ensure that we’re delivering best-practice usage of the SageMaker APIs. If you already use SageMaker, Runhouse requires no additional permissions or setup. Any user with permission to run SageMaker Training can use Runhouse SageMakerCluster. We can support other SageMaker compute types too! Let us know if you need another kind.

Better debuggability

Normally it’s difficult to iterate and debug SageMaker code because you need to submit your jobs for execution and essentially wait to see logs on the other side. Runhouse allows you to SSH into your SageMaker box, send arbitrary files or CLI commands up to the cluster, and streams stdout and logs back to you in real time, so you can debug interactively. By default, it keeps your cluster warm for 30 minutes of inactivity, instead of shutting down immediately.

Making ML Infra Fast and Homey

At Runhouse, we believe that your code should command your infra, and not vice versa. If this resonates with you, please drop us a Github star.

To explore Runhouse SageMakerCluster further:

Take a look at the SageMakerCluster API documentation
Try running the tutorial code: 🏃‍♀️Runhouse🏠 & SageMaker
Raise questions and feedback in our Discord or file a Github issue

If you’re interested in chatting about how to integrate rh.SageMakerCluster into your existing stack, book time with us. We may be able to offer competitive programs through AWS for you to test it.

In subsequent posts we’ll walk through more advanced use cases with SageMaker pipelines, including training and hyperparameter tuning, and how to set them up with Runhouse.