System Overview
Kubetorch is a system designed for fast, secure installation. End users and systems install the kubetorch
Python
library and use it to interact with the remote compute. In Kubernetes, a simple helm installation is all that is
needed to unblock initial execution.
Python Client
For users, Kubetorch consists of the following key primitives, all defined in Python. These are described more in depth in Python Primitives.
- Compute: resource requirements and environment specifications on which to run application code.
- Image: environment specifications to set up on the compute at start time, including pre-built Docker images and additional setup steps.
- Function/Class: a wrapper around your Python function or class, to be synced onto your compute. Once deployed, it returns a callable object that works just as the original function or class, only that it is called remotely instead of locally.
You can string together the primitives to dispatch your function or class to your specified compute with the image setup:
def sum(a: int, b: int): return a + b if __name__ == "__main__": compute = kt.Compute(cpus=1) remote_hello = kt.fn(sum).to(compute) results = sum(1, 3) print(results) # prints 4
Helm Chart
Kubetorch is installed onto your Kubernetes cluster by you or a Platform team owner via Helm. The
basic install is extremely simple, and takes about 5 minutes to install resources into the kubetorch
namespace.
While end users do not need to think about Kubernetes or write YAML, the platform is transparent to platform owners or advanced users.
There are a number of additional installations that unlock additional features, but are not enabled by default to accommodate existing clusters:
- Autoscaling: We typically recommend installing
KNative
to enable autoscaling (especially for inference user cases). - Telemetry and Traces: If you do not already have Kubernetes logging and telementry installed, Kubetorch comes with reasonable defaults that can deliver rich workload level telelmetry, system level traces, and persisted logging.
- Queueing: Kubetorch recommends users who want queuing to install with the KAI Scheduler enabled. This is an open source library maintained by NVIDIA that enables GPU sharing and gang scheduling in addition to standard workload prioritization and queuing.
- Filesystem: Kubetorch supports a number of systems to mount storage to pods, including JuiceFS and cloud provider native managed storage services.
- Ray: Through Kuberay, Kubetorch supports the launch and usage of Ray clusters for distributed programs alongside regular embarassing parallelism or PyTorch Distributed.