Announcing Kubetorch: Blazing Fast ML Development on Kubernetes

Today, we're announcing Kubetorch, a powerful ML development interface for Kubernetes, purpose-built for modern enterprise ML and reinforcement learning. Kubetorch delivers:
- 100x faster iteration and debugging (from 10+ minutes to 1-3 seconds) for complex applications like reinforcement learning, distributed training, and composite inference
- Eager Python APIs to launch and scale compute resources in code, and compose them together as heterogeneous, reproducible, and cost-efficient applications
- Scaling and fault tolerance up to thousands of nodes, elegantly handling dynamic workload resizing and infrastructure failures
- General-purpose, unopinionated architecture that drops cleanly into your stack and takes full advantage of the rich Kubernetes ML ecosystem
Leading ML teams and frontier labs already run thousands of workloads per week with Kubetorch, across training, inference, data processing, and evaluation, in both research and production. While most teams struggle with Kubernetesâs complexity, these organizations are shipping faster than ever, because their infrastructure finally works with them, not against them.
To support our growing platform and userbase, we're also thrilled to announce that we've raised a $5M Seed round led by Work-Bench and with participation from Hetz and Fathom.
Kubernetes for ML: Inevitable but Incomplete
Over the last few years, nearly every major AI lab and ML-forward enterprise has shifted to Kubernetes as they hit scaling and flexibility walls in traditional research systems like Slurm or opinionated vendor solutions like SageMaker. Kubernetes has become the substrate for modern ML, with unmatched scaling, reproducibility, portability, and ecosystem integrations.
Despite its ubiquity, ML people often loathe Kubernetes. It has a reputation for slowing progress to a crawl and overcomplicating simple day-to-day activities, with a few particularly glaring gaps:
- No development path with fast iteration and debugging
- Rigid workload shapes that canât be composed together for heterogeneous applications
- Opaque, cascading failures with no recourse other than restarting
- No comprehensive solution for data transfer and sharing, especially for GPU data (we wonât address this in this post, but will have more to share on it soon)
These gaps have come to a head in 2025, as reinforcement learning hits the ML mainstream. RL is the final boss of ML infrastructure, requiring systems to not only excel at every subtask of ML - training, inference, data processing, evals, etc. - but also compose them together in dizzyingly complex post-training sequences. This has broken every âsafeâ assumption built into ML tooling, driving a frantic search for flexible infrastructure that wonât break down over the next 2-3 years of RL acceleration.
How Kubetorch Works
Kubetorch is a programmatic Python interface to Kubernetes that feels as natural as PyTorch, and is built on top of the mature scaling, integrations, and rich ML ecosystem of Kubernetes. Just as PyTorch lets you send a model .to("cuda")
, Kubetorch lets you send arbitrary Python code (functions, classes, files) .to(compute)
to run and scale on arbitrary Kubernetes compute resources. This makes using heterogeneous, dynamic, and large-scale compute feel like fast local development.
def hello(name): print(f"Hello {name}!") def main(): img = kt.Image("ghcr.io/rh/ml:v3") compute = kt.Compute(cpus="1", gpus="1", image=img) remote_hello = kt.fn(hello).to(compute) remote_hello("Kubetorch") # calls the fn on k8s!
Fast deployment to Kubernetes, No YAML in sight â¨
While this simple structure might smell like other orchestration or distributed ML frameworks, Kubetorch uniquely solves three longstanding problems for ML on Kubernetes:
1. Iterations in Seconds, not 30 Minutes
The Problem: ML applications are an awkward fit for Kubernetes's development workflow, which was designed around CPU-centric web services. Its strong idempotency and reproducibility come at the cost of slow packaging and deployment, assuming debugging has already happened elsewhere, like a laptop.
ML's GPU-accelerated, distributed applications simply can't be run on a laptop. Every code change triggers a 20-40 minute iteration loop: tear down the existing Kubernetes resources, rebuild Docker images, push to registries, redeploy YAML, and redownload checkpoints and data. Hit a bug, add a print statement, repeat. Trying to avoid this long development loop with Dev Boxes and Notebooks introduces an equally long translation process to package and scale the workload later.
Kubetorch Magic: Kubetorchâs accelerated packaging and deployment system restores 1-2 second iteration for even large ML applications, and makes ML development feel like regular software development again. Instead of sending workloads to compute with a job submission, we bring the compute to the work, holding resources in place while you develop. This isn't a stateful notebook environment; itâs regular code and Kubernetes primitives with scale and reproducibility from the outset.
When you call .to()
, Kubetorch determines exactly what must be updated in your existing compute resources, including dependencies, images, and scaling, then hot-syncs the changes instantly. Itâs as fast as iterating in a tunneled IDE, but for any distributed ML application. All iteration happens in regular code; no special IDE or DSL is needed.
We hit an error, iterate on our code, and then redeploy in <2 seconds to start training.
Unlike Ray and Spark, we never serialize code to dispatch it, so you won't hit dependency mismatch errors between the local and remote environments. You can truly drive and scale the workload from any laptop, and when youâre done iterating, the application you push to version control runs as-is on your colleague's laptop, in CI, or ML pipeline orchestrator. Pushing to production simply means merging into main as-is, not translating into âproductionâ DSL code and packaging structures.
Proven Results from Our Beta Users:
- 600x faster iteration loops: 10-30 minutes rebuild+redeploy cycles â 1-3 seconds hot redeploy
- 2-3x faster research-to-production: 4-6 weeks translation, scaling, debugging â 1-2 weeks to build, debug, and merge as-is
âKubetorch makes getting from research to production effortless. We can spin up training and inference experiments on real hardware as quickly as we finish a code reviewâwithout the headache of managing cloud infrastructure.â
2. Dynamic, Heterogeneous Workloads in Eager Python
The Problem: Kubernetes workloads are static resource manifests and container images, where size, shape, and behavior must be rigidly configured upfront. This works for homogeneous tasks like pretraining but falls short for RL, composite inference, and automated training pipelines, which require dynamic resource allocation, inter-service communication, and different scaling patterns. Kubernetes requires you to stitch workloads together statically with no runtime interplay, but you can't run an RL job by specifying a Kubeflow PyTorchJob, a vLLM service, and code sandboxes separately, and simply saying âok, launch.â
Despite this, Kubernetes is still the least bad option here. Slurm was designed for fixed-size batch processing and has no concept of services or autoscaling, making RL's dynamic workflows practically impossible to fit into a Slurm job. Ray seems purpose-built for this heterogeneity, but it's actually designed for orchestrating tasks within a single workload, not composing different workloads. This means abandoning Kubernetes's rich ML ecosystem and scalability to rebuild everything in Ray's runtime, while losing control over the precise dependencies, scaling behavior, and compute resources of each sub-workload.
Kubetorch Magic: Kubetorch enables interactive, programmatic control over compute resources at any scale, composing Kubernetes workloads together flexibly in Python. This delivers the best of both worlds: the richness, scalability, and maturity of the Kubernetes ML ecosystem, with the fluency and control of a Pythonic distributed framework.
Heterogeneous workloads like RL become as simple as specifying subcomponents - training workers, an inference service, some evaluation containers - and driving their usage in Python. Instead of writing static YAML manifests and one homogeneous image for the entire workload, you dynamically create, scale, and delete arbitrary compute resources on the fly, and communicate between them in Python as your algorithm demands. There's no DSL, DAG, or opinionated runtime that takes months to adopt.
Our async GRPO implementation showcases this: autoscaling vLLM inference services work seamlessly with distributed PyTorch training workers. Each service has different dependencies, GPU requirements, and scaling policies, all coordinated through standard Python rather than complex orchestration systems.
# Autoscaling inference service inference_compute = kt.Compute(gpus="1", image=inference_image) \ .autoscale(min_scale=2) inference_service = kt.cls(vLLM).to_async(inference_compute) # Distributed training workers train_compute = kt.Compute(gpus=8, image=training_image) \ .distribute("pytorch", workers=4) train_service = kt.cls(GRPOTrainer).to_async(train_compute) # Deploy both in parallel inference_service, train_service = await asyncio.gather( inference_service, train_service ) # Use inference_service and train_service with standard # async Python APIs ...
See our Async GRPO Example to see how a few lines of Python with Kubetorch can replace opaque and deeply nested RL framework code.
Proven Results from Our Beta Users:
- 50%+ compute cost savings through intelligent resource allocation, bin-packing, and dynamic scaling
- Unlocking previously unavailable training and scaling methods (e.g. in SageMaker)
- Simplifying multi-stage workflows away from DAG orchestrators toward regular Python that is runnable locally, in CI, or cron
3. Indestructible Execution Model
The Problem: Today, any failure that can't be caught and handled inside your ML workload results in cascading destruction. OOM errors, GPU/NCCL errors, node preemptions, and simple unforeseen bugs cause total workload failure with only one recourse: restart from checkpoint.
As workloads scale, these failures consume 50% of an ML team's time as they find themselves constantly babysitting and manually unblocking jobs. This brittleness also makes cost optimizations like workload rescaling and spot compute painful if not impossible. Most GPU clusters sit at 50-75% utilization because without strong fault tolerance, itâs better to over-provision resources than to optimally bin-pack and schedule.
Kubetorch Magic: Kubetorch introduces strong isolation and supervision between the driver program and the compute resources. This eliminates cascading failures and gives the user control to handle errors in code. When a fault happens, we freeze the resources and return control to your Python driver to decide how to proceed. The same error handling patterns from regular Python development now work for distributed ML workloads.
For instance, instead of losing hours of compute progress when one worker hits an OOM error, you can catch the exception, save your progress, and then make intelligent decisions about how to proceed. You might automatically reduce batch size and continue, migrate work to healthy nodes, or even acquire additional resources to handle the workload. Kubetorch also supports distributed workload rescaling, spot training, and on-the-fly node eviction with minimal loss of progress.
train_ddp = kt.fn(train).to(gpus) batch_size = 8 while batch_size <= 256: try: train_ddp(batches=1, batch_size=batch_size * 2) batch_size *= 2 except Exception as e: if "CUDA out of memory" in str(e): print(f"OOM at {batch_size*2}, proceeding with {batch_size}") break else: raise e # Continue training with optimal batch size results = train_ddp(epochs=10, batch_size=batch_size)
Try doing that in Airflow!
You can see a number of use-cases which would be impossible with modern training primitives in our fault-tolerance examples.
Proven Results from Our Beta Users:
- 95%+ reduction in fault-resolution time
- Operational scaling, e.g. one team 5x-ed their daily workloads (to ~800 / day) to support more customers and more frequent retraining for fresher models, when previously the team was blocked on scaling due to operational overhead
Powerful Results, And More to Come
Kubetorch is already running in production at Fortune 500 companies, frontier AI labs, and AI-native startups. Weâve seen it restore development loops from minutes to seconds, improve compute utilization by orders of magnitude, unlock previously inaccessible ML applications, and eliminate multiple FTEs worth of debugging, babysitting, and workload management.
As we expand the core platform, we're also investing heavily in the observability, cost-optimization, and data movement tooling we offer alongside Kubetorch.
Ready to Ignite Your ML Infrastructure? đĽ
Kubetorch represents our philosophy that the ML code should command the infrastructure, not the other way around. We've built it to be as familiar as PyTorch and as powerful as Kubernetes, without the complexity of traditional ML platforms. Kubetorch isnât a speculative new direction for ML infrastructure, but an inevitability.
âProgrammatic control over ML compute is the new world. Now that I've seen it I can never go back to Slurm. It's hard to point to just one thing that's better, it's 100 things.â
Getting started is fast: installation takes less than 30 minutes on your own cluster, and running existing workloads with Kubetorch can typically be done same-day. Reach out today to join the Beta for early access.
For more information:
- Explore our documentation for guides, examples, and API references
- Get in touch to discuss your specific use case
Stay up to speed đââď¸đŠ
Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.
Read More

Kubernetes: The Winner for ML
