Key Components

Kubetorch consists of the following key primitives, all defined in Python. They work together to define an ML workflow, which is a Knative service under the hood.

Compute: resource requirements and environment specifications on which to run application code.
Image: environment specifications to set up on the compute at start time, including pre-built Docker images and additional setup steps.
Function/Class: a wrapper around your Python function or class, to be synced onto your compute. Once deployed, it returns a callable object that works just as the original function or class, only that it is called remotely instead of locally.

Compute

The kt.Compute class lets you specify the resource requirements and containerized environment in which to run your workflows. It also lets you control how the compute is managed and scales based on demand.

Easily set up the compute how you want it, including

Resource requirements: Specify any resource requests that your infrastructure supports. This can be the number of CPUs/GPUs, specific GPU types, memory, or disk_size.
Base image: Highly customizable base environment to use and set up at launch time, including pre-built images or further setup steps. This is explained in more detail in the Image section below.
Distribution config: Specify the number or type of distributed worker compute for replicas, or specify autoscaling parameters.
Kubernetes flags: Further specify Kubenetes specific flags, such as namespace, labels/annotations, secrets, and more.
Freeze setting: Whether to freeze the compute, to avoid new updates and code syncs for production use cases.

Example:

gpus = kt.Compute(
     gpus=1,
     cpus=4,
     memory="12Gi",
     image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3").pip_install(["transformers"]),
).distribute("pytorch", workers=4)

Image

The kt.Image class lets you specify a pre-built base image to use at launch time, as well as additional set up steps required for your program, such as additional installs or environment variables.

These additional steps are as follows:

pip_install: Run pip install for the given packages.
set_env_vars: Set environment variables.
sync_package: Sync over a locally installed package, and add it to the PATH.
rsync: Rsync over local files or folders.
run_bash: Run the specified commands.

The Image object is passed into the Compute object to define the containerized environment. When you launch your ML workflow, the pre-built image acts as the base image of the underlying Knative service. Afterwards, we run the additional setup steps prior to running your application code. The image can be updated and propagated through at any time for future deployments; we detect and run any differing setup steps in the order of seconds, without needing to recompile or rebuild an image.

Example:

kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3") \
    .pip_install("transformers") \
    .set_env_vars({"HF_TOKEN": os.environ["HF_TOKEN"]}) \
    .rsync("/data_folder")

Function/Class

The kt.fn and kt.cls classes are wrappers around your locally defined Python function or class. Once wrapped, these objects can then be sent .to(compute), which will launch a Knative service (taking into account the compute requirements) and sync over the necessary files to be able to run the function remotely. Once a service is launched, you can continue to update your Python code locally and redeploy to your compute with .to, which will re-sync over updates in seconds and return a new function/class that is immediately ready to be used.

The returned object is a callable that functions just like the original Python method, only that it runs remotely on your infrastructure, rather than locally. Because the same code runs locally as remotely, we ensure reproducibility and bridge the gap between research and production gaps.

When you call the object, it makes an HTTP call to the deployed service. Log streaming, flexible error handling, and observability are all built into the system. Debugging is made simple with the fast iteration loops and ability to ssh and directly interact with your compute.

Example:

def sum(a: int, b: int):
    return a + b

if __name__ == "__main__":
    compute = kt.Compute(cpus=1)

    remote_hello = kt.fn(sum).to(compute)
    results = sum(1, 3)

    print(results)  # prints 4