Kubetorch Github Repo

Open Sourcing Kubetorch

Kubetorch is now open source to fulfill a critical gap in the ecosystem. Researchers and developers need a simple interface into Kubernetes to run complex, heterogeneous ML pipelines.

Photo of Donny Greenberg
Donny Greenberg

CEO @ 🏃‍♀️Runhouse🏠

October 30, 2025

Two weeks ago, we announced Kubetorch, a powerful new way to build distributed ML applications at scale on Kubernetes. Due to overwhelming demand, we’ve chosen to open-source the core Kubetorch framework. “Overwhelming demand” means:

  • Enterprise and frontier lab alpha partners stipulating that the core framework needs to be open-source for them to migrate their training stack to it.
  • Influential open-source teams telling us they’d adopt it as a portable/multi-cloud infrastructure interface to build against if it were open-source, especially in their CI.
  • The product is called “Kubetorch.” Everyone assumes it’s open-source.

But why does the ecosystem need this?

Beyond "overwhelming demand" for free, high-quality software, there's a fundamental gap in the ML ecosystem that an open-source Kubetorch can fill.

Open-source ML frameworks lack a clear infrastructure interface that 1) exposes rich control and visibility into the infrastructure, and not some highly limited subset, and 2) works portably across cloud vendors, research institutions, and enterprises.

Kubernetes only recently emerged as the de facto standard for ML infrastructure. In the past, ML was run on a wide range of platforms, ranging from Slurm to SageMaker, so infrastructure agnosticism for OSS framework developers was practical. However, this meant that OSS ML frameworks operate strictly "above the infrastructure" rather than integrating with it, which creates several critical limitations:

  • Fault-tolerance - Current training frameworks have no visibility into infrastructure-level changes like node pre-emptions. They also rarely probe proactively for network or GPU failures, since they have no recourse if issues are found. All events are treated as unrecoverable failures and lead to program termination.
  • Scalability - When applications can't probe available resources or create containerized sub-routines, they're essentially designed for single-machine usage. It's nearly always more efficient to scale subcomponents independently rather than naively scaling a full complex application horizontally (like LLM inference). Some frameworks adopt distributed runtimes like Ray or PyTorch Distributed to scale, but these weren't built for this purpose and don't expose the hardware and scaling controls available in Kubernetes. Ultimately, serious users build their own scaling directly in Kubernetes.
  • End-to-end testing - OSS libraries rarely test realistic deployment scenarios in CI.
  • Cost efficiency - Without infrastructure awareness, optimizing resource utilization is nearly impossible. Frameworks often massively over-provision or under-optimize; they cannot right-size resources, leverage spot instances, or share expensive GPU resources across experiments.
  • Heterogeneity - Users compose multiple tools in a pipeline for ML applications: Ray for data processing, PyTorch distributed for training, Triton for inference. But OSS frameworks don't share the same compute "canvas" as users, forcing stark all-in decisions on a single framework and shoehorning the entire application into it. They also can’t use containerization to isolate dependencies for different subsystems, so they shoehorn them together and burden users with days of SAT solving the version mismatches.
  • Composability/Modularity - Most frameworks see themselves as the end-to-end application, not as a subroutine. They launch complex sequences through massively branched CLI configs. Combining tools together almost always requires deep code surgery.

There are a few clear examples of these phenomena in the ecosystem:

  1. LLM-d - To oversimplify, this is “infrastructure-aware” vLLM, essentially migrating its usage of Ray to compose LLM inference subsystems for independently scalable Kubernetes-based ones.
  2. Code/Evaluation sandboxes - Launching containers within an RL workload is basically an unsolved problem in 2025. We see some frameworks insisting on manually launching dozens or hundreds of Docker containers inside the workload (i.e., Docker in Docker in Kubernetes) and others writing an unreadable mess of kubectl commands in Python.
  3. RL frameworks, like VeRL and TRL - Despite an already rich ecosystem of scalable training and inference components, most RL libraries rebuild these from scratch.

Kubetorch: A Programmatic Zero-Cost-Abstraction Interface for ML Infrastructure

Without rehashing our launch blog post, Kubetorch gives OSS ML developers an interface to integrate deeply with infrastructure, in a way that’s portable across the OSS ecosystem. Writing scalable, resource-efficient, heterogeneous, and fault-tolerant ML frameworks becomes easily accessible in Python, and composing together existing frameworks and libraries is natively supported. The abstractions are designed not to limit the control and integration richness of Kubernetes and its ecosystem, so users can tap into a world of existing scalable primitives. Kubetorch users can always reach for the most performant or scalable tool or system for their task without worrying about framework or cloud limitations.

With the release of Kubeflow’s Trainer v2, which includes a Python client for launching training, several other Kubernetes ML projects have stated the intention to include Python clients for programmatic usage. Kubetorch can support all of these CRDs with unified Python interfaces and behavior. Further, few of these projects will have the resources to support the fast iteration, debuggability, composability, and fault tolerance that Kubetorch provides out of the box for their Python interfaces.

What’s inside? The OSS Framework vs. Platform

We use the term “Kubeotorch core framework” to mean everything that actually touches your ML code (i.e. the Python client and helm chart that allows you to run our public examples). This is opposed to the "platform," which includes all the persistent observability, compute sharing (prioritization/queuing/scheduling/quotas), and global configuration to help teams manage and optimize their clusters. We feel this is the cleanest open/closed split we can offer: All the performance, DevX, and fault-tolerance magic, which are the concerns of ICs, go into open-source, and all the governance, admin, and cost-optimization, which are the concerns of management, are in the closed, paid offering.

Unifying the GPU Rich and Poor

Finally, there’s also the practical matter that many ML teams and OSS maintainers do not have platform teams to manage Kubernetes for them, but should not be excluded from the ecosystem by choosing serverless options. Kubetorch Serverless provides a low-touch interface that absolves these teams of configuring and managing their own clusters, but seamlessly allows them to easily move to a first-party cluster as they scale, without code changes.

Our Kubetorch serverless offering is currently in alpha and will be formally announced soon. If you’re interested in trying it, please see here: https://www.run.house/kubetorch/get-started

If you’re an open-source maintainer or research institution, we have special pricing and credits available.

Looking ahead

While Kubetorch is still early, we’re excited to begin this journey of bringing the open-source ML ecosystem and Kubernetes ecosystem together. If you sit on either side of this fence and are interested in collaborating with us, please reach out at team@run.house, or join our Slack workspace.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.