Kubetorch

Kubetorch is a modern interface into Kubernetes for running any ML workload (not just PyTorch). Its simple, Pythonic APIs give researchers and ML engineers access to compute, enable fast iteration, and ensure reproducibility.

  • Run any Python code on Kubernetes at any scale by specifying the required resources, distribution, and scaling directly in code.
  • Iterate on that code in 1–2 seconds with magic caching and hot redeployment.
  • Execute code reproducibly by dispatching work identically to Kubernetes from any environment, including a teammate’s laptop, CI, an orchestrator, or a production application.
  • Handle hardware faults, preemptions, and OOMs programmatically from the driver program, creating robust fault tolerance.
  • Optimize workloads with observability, logging, auto-down, queueing, and more.
  • Orchestrate complex, heterogeneous workloads such as RL post-training, which requires coordinating different compute resources, images, scaling, and distributed communication within a single loop.

Kubetorch is installed with a Python library and a Kubernetes operator, all within your cloud account and VPC. It can be adopted incrementally within an existing ML stack or used as a full replacement, across training, batch processing, online inference, hyperparameter optimization, and pipelining.

Why Kubetorch?

Kubetorch was built to solve the many anti-patterns of ML development:

  • Notebooks and dev pods enable fast iteration, but limit distributed scaling, cannot support inference, and create a huge productionalization bottleneck.
  • Hyperscaler platforms like SageMaker, Vertex AI, or Databricks impose rigid opinionation while being costly due to their inefficient use of compute and markups.
  • Direct development on Kubernetes or using Kubeflow is more flexible, but requires debugging complex YAML and waiting out long iteration cycles (30 minutes to push, build, and relaunch pod with image pull and data reload).

Instead, with Kubetorch teams can:

  • Go from research to production 5x faster, by fully eliminating the research-to-production gap. All code is production-ready; your team works out of a monorepo and uses a regular SDLC and GitOps to go to production. There is no lengthy translation process or risks that work doesn't reproduce in production settings.
  • Save 50% on compute costs, through significantly improved bin packing, intelligent resource optimizations, and platform controls that enables inference scale-to-zero and idle (or underutilized) workload eviction.
  • Reduce production faults by 95%, since you can directly catch and handle errors in your ML program and build in the right behavior, like dynamically retry with larger compute after an OOM or run again with a smaller world size after pre-emption.

Learn More