A Short, Conceptual Guide

Running your ML workload with Kubetorch requires three short steps, all in Python:

Define the compute resources you want to use in Kubernetes, including autoscaling or distributed
Deploy your Python function or class to that compute as a Kubernetes service
Call that remote service in local Python, just as you would the original function or class

Standing up compute

A basic launch with GPUs and a container image might look something like this:

import kubetorch as kt
gpus = kt.Compute(
     gpus=1,
     cpus=4,
     memory="12Gi",
     image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3").pip_install(["transformers"]),
).distribute("pytorch", workers=4)

At a high level, the compute request includes the resources and dependencies to run your Python code. Resources typically include GPUs, CPUs, memory, disk, etc. In the example above, we're agnostic to GPU type, but you can also specify gpu_type if the GPU product discovery plugin is installed on your cluster. Dependencies are generally specified as the image in which your application will run — a base image from your own Docker registry or a public image — and we let you lightly modify it for fast development without having to rebuild the image, such as by adding pip installs, setup commands, or environment variables.

The time to bring up compute can vary from a few seconds to a few minutes for a completely cold start. If there's compute available on the cluster which meets your resource requirements, start times can be as fast as loading your image. If not, the Kubernetes autoscaler will bring your compute online, adding a few minutes of latency on the first run. If you use a large image (e.g. PyTorch, Tensorflow, etc.), it might take a few minutes to pull on cold start, but Kubernetes does cache your images on nodes. Note that these initial wait times are a one-time cost until your service is torn down. Iteration will reuse the same compute and do not require reprovisioning.

You can optionally add the distribute() or autoscale() method for distributed programs. For distribute(), this might be PyTorch, TensorFlow, Ray, Spark, Dask, etc., or simply a pool of parallel replicas of your service that can be called concurrently (e.g., for parallel offline batch embedding). We help wire up the pod-to-pod communication and environment setup — you just bring your distributed program written with Ray, PyTorch, etc. For autoscaling, you have options such as setting min/max replicas or choosing concurrency or rps as the metric.

Finally, you can control automatic teardown of the service using inactivity_ttl in Compute(), or scale-to-zero behavior using scale_to_zero_grace_period in .distribute() or .autoscale(). This allows you to free up resources when idle, while keeping the service available during active development so you don’t need to relaunch compute each time.

Dispatching workloads

In the spirit of PyTorch, where workloads are sent to GPUs using .to(), Kubetorch lets you deploy any Python class or function to the compute you defined using .to(). This might be the entrypoint to your training loop, a single model to inference, or a class that encapsulates a complex workload. The possibilities are as endless as Python itself. You don't need to worry about adapting your function or class in any way, such using global variables or importing code from elsewhere in your package or repo. We automatically sync your repo or package along with the function or class, and it will behave normally in the remote service.

train_ddp = kt.fn(train).to(gpus)
# train_ddp_class = kt.cls(train_class).to(gpus)

The magic of Kubetorch is that after launching the first time, subsequent .to calls should only take 1–5 seconds to sync your local changes to the remote service. Compare that to rebuilding docker images and rerunning a Kubeflow pipeline with a slight update, can take 30-60 minutes. In production (e.g., recurring training scheduled by Airflow), you probably don't want to deploy new code into the image or install new dependencies. You can set freeze=True in kt.Compute() to ensure that the service relies only on the code baked into the container image, and .to() simply deploys the service (or does nothing if it's already up).

Calling the remote workload

You can call the remote function or class as if it were local to trigger execution on the Kubernetes service.

results = train_ddp(epochs=10)
# train_ddp_class.load_data("s3://my-bucket")
# train_ddp_class.train(epochs=10)

When you call the service, we're making an HTTP call up to the deployed service in Kubernetes, and streaming back logs in parallel. While this service is a proper HTTP service, for security it is private to the cluster by default. When you're working from outside the cluster, we create a secure port-forward to a jump pod to proxy your request. From inside, we detect that no jump pod is needed and simply call the internal service URL of the service. Note that because these calls are being made via HTTP, your parameters and return values must be JSON serializable.

Exceptions which are raised inside your code or which arise during execution at the pod or cluster level (e.g. OOMs, preemptions), will be propagated back to you as Kubetorch exceptions which you can except and act upon. Critically, exceptions do not cascade and force instant termination of the pods as they would in Kubeflow - your service remains in place (except of course when the error arose from pod preemption or failure), and you can even ssh into the pods with kt ssh <service_name> to live debug. The ability to handle errors in Python and elimination of cascading failures means your application can be orders of magnitude more fault-tolerant than most AI platforms allow. You can catch an OOM and choose to simply teardown the service and relaunch it with higher memory. You can catch a GPU error during training and immediately make a call to checkpoint your model and lose no training progress. You can even catch a node preemption during training, and simply continue on training without it!

Because the same Python code works from inside or outside the cluster, but still always runs on the same compute in Kubernetes, there's no "research-to-production" gap — you get identical, reproducible execution whether you're running this code locally, in CI, in a production job executor, or on someone else’s machine.

Teardown

In code, you can always teardown your service with .teardown()

train_ddp.teardown()

You can also use the CLI:

kt teardown service-name

We also let you set cluster-level configuration for teardown and autostop for inactivity. Once torn down, the resources will be reclaimed by Kubernetes, and nodes will be autoscaled down based on your cluster configuration.