Kubernetes: The Winner for ML

Kubernetes has emerged as the definitive compute substrate for machine learning by offering unmatched flexibility, scalability, and reproducibility across diverse workloads and environments.

Photo of Paul Yang
Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Published June 11, 2025
kubernetes as the winner

There is no single definitive reference architecture for AI/ML. Teams have wide-spanning requirements, varying by organization, across research and production, for traditional ML and generative AI, from batch processing to training to online inference. But if you cut through the noise, what’s relatively certain is that Kubernetes is the clear winner to service these extremely heterogeneous demands, while offering flexibility and scale.

Kubernetes has already been battle-tested by broad adoption in software, but there are several demands for developing AI/ML systems (we will just shorten to “ML”) that are distinctly aligned to Kubernetes’s feature set. To briefly taxonomize the benefits: unopinionated about workload, ability to support scale, efficient resource bin-packing, and wide reproducibility. For almost every team, Kubernetes is provably the best compute substrate for ML, and below, we will detail why.

Kubernetes is the unopinionated standard for scalability and reproducibility

Provably General Purpose

Kubernetes is entirely unopinionated about what you run within its pods, and as we noted above, there’s significant heterogeneity in ML workloads. Teams demand horizontal scaling, vertical scaling, embarrassing parallelism for batch processing offline, serving online, etc. Kubernetes, as a compute platform, supports all of this heterogeneity out of the box through simple pod replication and declarative resource specifications.

Moreover, the ecosystem continues to develop at a rapid pace; choosing Kubernetes will never lead to downstream outcomes where some new open-source library or distributed framework is unavailable to be run on Kubernetes. You should be able to point at any GitHub repo you see on Hacker News and immediately run it. When your team uses a vendor platform with buttons to “do X,” your flexibility to experiment is greatly limited by what a Seattle-based product manager decided was important. For instance, we see teams shoehorning ML training into Spark without using Spark at all or stacking frameworks like Ray on top of Spark, simply because their choice to fully adopt Databricks for data warehousing has created strict constraints over what they can run.

Being generic also has the advantage of portability. Every Kubernetes cluster can run your containers identically. At the startup scale, portability confers the ability to chase down hundreds of thousands of dollars in cloud credits scattered across incumbents and neo-clouds. At the enterprise scale, hybrid cloud and multi-cloud are the de facto norm. The last few years have also highlighted the value of adopting neo-clouds to find the best GPU availability, lowest cost, or highest-end hardware. Your ML team’s code should not have to be aware of what vendor it runs on, and Kubernetes provides that abstraction. By contrast, it would be unimaginably hard to move from AWS Sagemaker APIs to Google Vertex APIs.

And when viewed from the perspective of the broader software platform, using Kubernetes avoids an awkward fork between core and ML software platforms. If Kubernetes is generic enough to handle ML workloads, then it is preferred to run infrastructure that can plug into unified observability, auth, cost management, etc., which likely can be automatically deployed with any new cluster. Using Kubernetes means general platform engineers can upskill quickly and support the ML platform as well. There’s certainly differentiated ML domain knowledge, but by using Kubernetes as the compute substrate, there’s a wide pool of talent available to be made into ML Platform owners. Today, MLOps feels extremely siloed from the platform team, and their upskilling is too tied to vendor knowledge.

Scalability

Scalability is a generic term, so we should be precise about the most important facets for ML. From the infrastructure perspective, Kubernetes offers horizontal scaling capabilities where you launch a container and then scale down to zero replicas or up to theoretically unlimited replicas. For inference, you will want to horizontally autoscale based on demand. For training or batch processing, modern frameworks achieve vertical scale via horizontal scale; Spark, Ray, Dask, etc. all require replicas for additional workers for “vertical-ish” scaling to give a single task more resources. On the management side, scalability means resource and cost efficiency s the cluster scales. Compute forms a common pool, and this means that many applications can be efficiently bin-packed to fully saturate compute.

ML applications demand scalability. Within an ML pipeline, initial data preprocessing (Spark, Dask, Ray Data, Daft, etc.) immediately relies on worker replicas that are easily spun up via Kubernetes. In training, even lightweight models like XGBoost benefit from parallelizing hyperparameter search instead of sequentially doing it on a single node. And the ability to scale up a training with distribution is a “go-faster” dial that many teams cannot access without Kubernetes. If it takes 1 day to run a training, Kubernetes makes it simple to replicate your training loop over 12 nodes, to run in 2 hours (plus maybe a few minutes for overhead), and view the experiment sooner. This happens without a cost differential since the total consumed instance time is the same (the only limitation is quota). On the inference side, autoscaling is a P0 requirement even more than any other online service, since GPU capacity is more limited and expensive.

Similarly, the bin packing point is particularly important for ML workloads due to the high cost of the ML compute and the fact that most VMs aren’t custom-shaped to ensure your task can simultaneously saturate GPU memory, RAM, and CPU. Without Kubernetes, each task is mapped to a VM, and waste can become egregious. As an Azure example, using a single mid-grade NVIDIA A10 GPU with 24GB of VRAM also forces you to rent at least 36 vCPUs and 440GB of memory. Without Kubernetes, delivering new batches of data consumes a bit of memory, but still leaves most of that 440GB sitting idle. Additionally, bin packing multiple workloads onto a single GPU to fully saturate every GPU is becoming increasingly table stakes. Modern schedulers, like the NVIDIA open-sourced Kubernetes AI (KAI) scheduler, enable your team’s efficient sharing and full utilization of GPUs within your cluster.

Reproducibility

Containerized applications became the dominant pattern across software because they ensure consistent, reproducible execution. Kubernetes is a declarative infrastructure for scalable container orchestration, and the only way to run and deploy containers at any reasonable level of scale. To briefly define declarative, the developer specifies the infrastructure they need, including any required secrets, and Kubernetes creates the desired infrastructure for running the application. “I want 16 CPUs with this team image,” ensures every run and every teammate starts with the same resources, with reproducibility for OS, libraries, and drivers. The request and containers are all version-controlled through regular software lifecycle practices. Especially in ML, where setup, package version, driver version, and overall environment are extremely impactful to results and outputs, containerization is necessary for reproduction.

But why do we need reproducibility? First, in ML, research-to-production is fundamentally a question of workload reproducibility. In many teams, the code that is written during research is not “real code” that can run as-is in production (whether in Notebook, devbox, etc). This is a significant anti-pattern that is not seen anywhere else in software development. Instead, a long, multi-week process to fully rewrite the research code for production is necessary, breaking the reproducibility of the code and often the reproducibility of the result. Instead, to avoid that headache, introducing containerization and Kubernetes means that there is no translation necessary to reach production.

As a second, related point, reproducibility is the basis of a team’s productivity, especially across onboarding, collaboration, and iteration. My ability to reproduce work is the starting point of a new project or a debugging fix; then, I can make iterative changes. Onboarding time is significantly increased when the process of updating a model is not simply checking code out and then opening a PR. This also ties into proper lineage and governance. Any production system has lineage when it has a pipeline that can be trivially rerun, and that pipeline can be governed through regular version control and audits. Conversely, the anti-pattern is producing a model checkpoint out of a Notebook and relying on written documentation rather than containerization to enforce reproduction.

Containerization does come with a drawback; for many teams, the only way to execute any code change is through rebuilding a container. For most of the software world, there is the ability to test locally, before containerizing to deploy in production. The heavy process to build a container in CI is a fixed, infrequent cost. Unfortunately, given the size of datasets and the need for GPU acceleration, there is no local development and testing first. Many ML engineers need to test through deployment and rebuild for execution on the powerful compute available in the Kubernetes cluster, repush to a container registry, relaunch pods, and then repull data/checkpoints into that new pod. This can introduce wait times significantly in excess of 20 minutes to simply add a print statement. We explain how this can be avoided later, but this long iteration loop is why so many teams are tempted by Notebooks or direct SSH / devboxes, even if they are worse for reproducibility.

Kubernetes outperforms other candidates for ML compute

We can directly contrast using Kubernetes against the three most common ways to execute ML workloads.

Notebooks and DevBoxes

The most convenient way to start launching new ML projects on a single-node static machine. This ranges from doing purely local development, launching a hosted Notebook using any service, or SSH’ing directly into a VM, or launching a Devbox. The appeal is obvious: simplicity. ML requires iteration while using powerful compute, and the next best thing to working locally on a $10,000 desktop tower is to work locally somewhere else with a GPU.

But this approach is not well-aligned with a traditional software development lifecycle. Notebooks and DevBoxes have similar, but slightly different, sins. Without reproducing substantial existing literature critiquing Notebooks, briefly, the issues are: an inability to be properly version controlled (as Notebooks are JSON with outputs, not text files of code), difficulty modularizing code and reusing components, and no integration with testing frameworks. DevBoxes and working directly on VMs suffer from some of the same “statefulness,” but that comes from the interactive commands that are run to set up the VM, the environmental setup, the file system, and potential OS/driver-level differences.

But importantly, both tools fail to scale past the bounds of what a single node contains. One funny phenomenon is how many teams run distributed training as 8-GPU workloads. Why? That’s the maximum amount that a single VM typically holds, and is the next size up from 4 GPUs. There’s a steep scale barrier to increase to 9 GPUs or decrease to 7 GPUs. But scalability is not just for running massive distributed trainings; running hyperparameter sweeps on a notebook or devbox is impossible, as is testing production-like inference with autoscaling.

So when employing these tools, there is a 2-week to 6-month process to move research work to production, which we noted above under reproducibility. There is a tactical need to refactor the code to become regular Python code. Then, the bigger impact might come from the reproducibility of the research findings themselves in production: there is a need to scale up to full-scale compute, full-scale data, and run in full. Researchers might be responsible for carrying their own work to production, but many teams hand off the responsibility, which then adds a further burden of context transfer and coordination. But in the end, this work typically ends up on Kubernetes anyway.

Vendor Abstractions

Vendor-specific tooling comes in many shapes, but suffers from common issues – namely, opinionation and rigidity. As we noted earlier, if you stay within the paved path of what a Product Manager at a hyperscaler has decided is important, then the development experience tends to be fine. However, you are conforming your development to the abstractions the provider has decided on in exchange for avoiding infrastructure. This means that trying an arbitrary method is impossible. On Sagemaker, you can run a hyperparameter search with grid search or Bayesian optimization, which is great, but if you want to define custom search logic, you’re back to writing and deploying your own code.

As a meta point, many of these services are opinionated abstractions over Kubernetes or Kubernetes-like systems, where you pay a meaningful compute upcharge while sacrificing flexibility and bin-packing. Google Cloud Run is literally built on Kubernetes, but limits you to just L4s while imposing a stiff upcharge on equivalent compute per minute. AWS Lambda roughly does the same set of tasks, but does not allow GPUs and restricts run duration. Databricks marks up compute by 50+%. Cynically, sales engineering teams push end users towards these solutions because they represent systems with significant vendor lock-in; less cynically, the simple omission of the word “Kubernetes” makes the system significantly less scary.

It is too harsh to judge reliance on vendor tools, as teams point to governance or auth as killer features built into these tools. But these are the same cloud governance tools available to you more generally in managed Kubernetes. Especially if you already have established engineering platforms with regular dev and production environments, inheriting that governance into Kubernetes is essentially free. (If they are not, don’t start your governance journey with Sagemaker Auth controls.) Launching production workflows by adding a scheduler to Databricks Notebooks simply because they come with governance ignores all of the reasoning behind software and platform engineering best practices.

Slurm

Some researchers love to work out of Slurm, which is a traditional HPC workload manager that can schedule and allocate resources at scale. This may partially be an attachment to what folks already know; academia lives in Slurm, so, naturally, Slurm is popular for research work. A distinct advantage of Slurm is returning iterative development to researchers, while letting them write regular code and command scale. Compute is persisted across execution, so there is no wait time to rerun changes to code, while a mounted shared filesystem across all nodes makes for convenient iteration loops and shared data files.

There are a few drawbacks, though. The main complaint about Slurm is poor developer experience when faults happen. This is twofold. First, there are challenges in Slurm's debuggability. Logs can be hard to find and are scattered across nodes; there isn’t a standard log streaming and error handling. Across a range of regular errors like OOMs or CUDA errors, a failure can destroy a job, leaving nothing to trace (or you dump the state at failure and hope that’s correct). Secondly, Slurm doesn’t come with the fault tolerance that other systems have. At the infra level, Kubernetes comes with health checks and standard retries out of the gate. At the program level, because Slurm is essentially running your program as a bash script, the ability to catch and handle errors on the fly (like in regular programming) does not exist.

Another obvious flaw is that Slurm has no meaningful place for production workloads, whether you consider production to be recurring production training, offline/batch inference, or online inference. You will end up operating a Kubernetes cluster for some of your AI/ML tasks either way, and now you’ve cleaved your ML compute into multiple pools and have multiple systems to maintain. You'll likely also run into a research-to-production handoff problem where someone has to turn a bash script into a containerized application. Notably, Slurm makes sense when you reach the massive AI lab scale, where the research can be fully carved off, and doing massive training runs is the goal (and even at these organizations, heterogenous workloads such as reinforcement learning are increasingly done off of Slurm). But otherwise, launching a Slurm cluster is giving in to the temptation of offering one stakeholder group an easy way out at the cost of the lifecycle as a whole.

What’s the Catch or “I Heard Bad Things about Kubernetes”

There’s always an abstract concern that “Kubernetes is hard to manage and only for scaled companies,” or “we’re just not there yet.” We cannot handwave away the fear, but you should believe that it is not that hard to “have” a cluster; you can launch a Kubernetes cluster with Terraform or GKE Autopilot in about 15 minutes, and the ongoing maintenance is a well-understood task. Built-in cloud tooling makes it easy, and besides, DevOps and Platform Engineers have decades of experience managing logging, observability, and auth. It might even come free if your organization already operates Kubernetes clusters, and it's certainly a better thing to learn. Either way, layering in observability or cost controls should be table stakes, but it’s far better to do it in the ecosystem-rich world of Kubernetes than in a niche world of a vendor solution.

But once on Kubernetes, there are other complaints for developer experience. The most common complaint is that the developer experience can be horrific on Kubernetes. Specifically, we call out above that most teams are forced into “development through deployment,” making rapid iteration nearly impossible. Unlike traditional software, where changes can be tested locally before being pushed through to CI, many ML teams must test changes via deployment. Code changes, as small as inserting a print statement, still require rebuilding a container, waiting for Kubernetes to pick up the container, load any required data/models, and then finally running. The typical iteration loop for testing changes is 30-60 minutes, even for at-scale ML teams at companies that you would expect to be operating more efficiently. This is obviously bad and saps the productivity of your valuable research and engineering teams.

Adopting Kubernetes also poses a skill wall, because it suddenly requires everyone to “learn Kubernetes.” Rather than debugging console errors interactively in Python, the team is now asked to learn how to debug YAML, interact with the cluster to find logs, and take on upskilling every new member of the team. Moving away from well-understood Python and local-like execution can impose an immediate and recurring productivity hit.

We Built Kubetorch for ML on Kubernetes

Hopefully, by this point, we’ve convinced you that Kubernetes is a good place to “do AI/ML.” However, if either the stated pains resonate with your existing Kubernetes-based workflow or you’re still wary of the cost/benefit of adopting Kubernetes, we’d like to propose a solution: Kubetorch. At a high level, we’ve built Kubetorch to be the best packaging and deployment companion to Kubernetes for ML workloads. It allows you to instantly dispatch workloads to Kubernetes and call the workloads running “remotely” as if they were local.

First, we dramatically improve the developer experience by re-enabling interactive development in regular code. Instead of long rebuild and redeploy loops, code changes can be tested in <2 seconds; changes are hot-synced, and all your data and artifacts are already in place to rerun. This enables great developer experience with instant iteration (and streaming logs) without sacrificing code quality or scale of compute commanded.

Second, we offer a Pythonic interface to Kubernetes. YAML is tedious to write and maintain, while Kubernetes can make it hard to debug failures. With Kubetorch, developers write Python to command powerful cloud compute, manage it (e.g., autostops), launch replicas, and set up distributed execution environments.

Finally, you identically use Kubetorch from any Python environment to launch execution, so research-to-production is as simple as checking in your code and running the same Kubetorch .to() from your production environment, whether an Airflow node or from CI.

gpus = kt.Compute( gpus=1, image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3"), ).distribute("pytorch", workers=4) train_ddp = kt.fn(train).to(gpus)

Kubetorch preserves the flexibility and unopinionatedness of Kubernetes – bring any Python program, your containers, choice of orchestrator, and any cloud. Developers run a pip install, while the platform team can just helm install Kubetorch onto any cluster, and we’ll integrate right into the existing infrastructure stack.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.