slurm and kubernetes logos

SLURM, SUNK, SLINKY, SLONK: Chasing Speed, Stability, and Scale for Bleeding-Edge ML

Increasingly, we are being asked by ML infrastructure teams what to do about Slurm, and more specifically, the emerging “Slurm on Kubernetes” trend. We work with dozens of such teams, and decided to weigh in here with a broad survey of the patterns and anti-patterns we see around Slurm (and Slurm-like) development in modern ML environments.

Photo of Paul Yang
Paul Yang

ML @ 🏃‍♀️Runhouse🏠

July 10, 2025

While Slurm is traditionally the scheduler of choice for HPC and academia, for many years it was also wholeheartedly adopted for training workloads at private AI labs. Its ability to instantly scale workloads across many machines, fast cold start times (and therefore iteration, if compute is widely available), and rich scheduling controls made it an excellent fit for ML training research. Its human-process-centric workflows, rigid and homogeneous parallel compute model, and limited production support (statefulness/reproducibility, fault-tolerance, observability, online/serving workloads) were not relevant for pure research training activities.

The speed and scalability of Slurm are particularly top of mind today, as nearly every other AI/ML execution system forces you to trade off between iterability and scale. Single-node tools like dev boxes or notebooks have fast, interactive iteration, but utterly lack compute and data scale (not to mention reproducibility or production support). By contrast, Kubernetes allows such powerful scaling and production support that it’s now the definitive compute foundation for ML engineering teams (and the standard substrate for delivering cloud compute, as we discussed in another blog post). However, Kubernetes offers horrific iteration speeds and development experience for researchers. The 30+ minute “execution through deployment” iteration loops (rebuilding containers, redeploying, relaunching pods, etc.), the instantly cascading failures which leave no environment behind to debug, and the sea of YAML, kubectl and Docker incantations for everyday development, all make Kubernetes and the standard Kubernetes tooling (Argo, Kubeflow, etc.) a poor fit for research.

So for many years, Slurm bridged this gap as the only known system for fast iteration on scaled compute, and teams took on the burden of painfully migrating the Slurm research to Kubernetes production as needed. And recently, pressure from infrastructure teams and cloud providers to unify around Kubernetes as the underlying observability, management, and “delivery” layer for ML compute has led to a proliferation of Slurm-like interfaces on top of Kubernetes (SUNK, Slinky, Soperator) for the researchers.

However, we argue that shim-ing Slurm and Kubernetes together is an imperfect and temporary stopgap, one that is already showing significant stress fractures at many AI labs. As training workloads grow increasingly complex, such as RL post-training, online retraining, and model-aided data processing and evaluations, Slurm’s execution model is proving unable to accommodate the heterogeneity, debuggability, reproducibility, fault tolerance, automatability, and productionization needs of many modern AI/ML teams. We’ll close by sharing what we feel is the ideal “best of both worlds” experience for fast iteration at scale in ML, thoughtfully incorporating key properties of both systems rather than bluntly slamming them together - a system called Kubetorch, which we are privately incubating now.

Why Slurm Works

Commanding Scale

For most teams, Slurm is the simplest way to launch a multi-node distributed training. Other research systems like devboxes, SSH into VMs, or Notebooks are largely limited to a single node.

  • Familiar, Linux-centric interface: Compared to the complex interfaces and execution flows of other distributed compute systems, Slurm’s interfaces are two of the most familiar to any researcher - the file system and the command line. It lets you tweak and manipulate these as usual on a single node, and then predictably replicates them on many nodes to scale.
  • Elastic growth without rewrites: The same srun line (theoretically) works on 4 GPUs or 1,000 GPUs, so your training loops can access a “go faster” button.
  • Simple distributed communication: Slurm’s MPI-style launching (SPMD) is a minimal, widely supported execution model, so it works with PyTorch Distributed, Tensorflow Distributed, Horovod, DeepSpeed, etc. out of the box.

Restoring the Iteration Loop

Slurm deploys a workload to a node simply by running a CLI command on it, so there’s no packaging or allocation latency, producing excellent cold starts and iteration speeds at scale.

  • Shared filesystem: Data and code both live on shared network storage, and there is no setup latency to re-execute work.
  • Compute (can) stay available: Users can (but often don’t) configure certain workloads to allow interactive re-execution or even “hero” priority for a block of compute, preventing the user from having to wait in the queue for each rerun.
  • No container rebuilds: Requiring a rebuild of a container to rerun a workload is a fundamental flaw of many Kubernetes (or vendor) platforms, since that process can take 20+ minutes to move through CI, rebuild, and deploy.

Scheduler

The goal for a compute platform is to maximize priority-weighted throughput of workloads, and this is principally what Slurm was designed to do.

  • Precise resource requests: Users specify exact resources (CPUs, GPUs, memory, time), enabling the scheduler to tightly pack jobs and minimize idle capacity.
  • Rich policy support: Supports priorities, fair-share, preemption, reservations, and advanced QoS controls to optimize cluster usage for extremely valuable GPU resources.

Where Does Slurm Struggle?

Debugging

Iterating on code can be fast, but finding the error is not. ML workloads are long-running, and distributed execution makes everything much more complex. Slurm execution is running a bash command, and what happens after is ungovernable.

  • Cascading failures: Failures can easily destroy the entire environment and leave nothing to debug. Many teams must jury-rig tooling to dump state to the shared filesystem on CUDA/NCCL crashes.
  • No fault transparency or handling: Relatedly, there’s no propagation of errors and no recourse when they happen; errors must be handled entirely from within the application itself, with no “catching” possible. The only recourse is almost always to simply rerun from a checkpoint. Some have adopted Ray on Slurm (on Kubernetes!) to propagate some errors more transparently within the workload.
  • Compute isn’t always warm: When you cannot cordon off a “hero” group of nodes for yourself, you once again face slow iteration loops to debug your program as you wait in the queue over and over.

Workload Heterogeneity

Slurm is best at doing one "big" thing at a time, with a fixed set of compute for the duration of the job and more or less homogeneous parallelism that fully saturates each node. Sequential chaining of activities is done by humans in between. Historically large-scale training fit nicely in this envelope, with the bulk of activity happening on a large block of GPUs, and only a small amount of other activities interfering at the start or end. Automated production ML pipelines did not fit into this model well, with various data processing, training, and evaluation steps of different sizes, and relied on systems like Airflow for heterogeneous execution. However, training workloads are growing increasingly complex and looking closer to "production" in their elaborate sequential activities and real-time use of inference and CPU-centric environments. Slurm’s fundamental incompatibility with heterogeneity means these workloads simply do not "fit" inside a Slurm job, and leading some AI labs to outgrow it.

  • Fixed-size resources: A Slurm workload requests a single size of compute for the entire workload, without an easy way to decompose each task into the right resource size. This can often mean leaving hundreds or thousands of CPUs idle while the GPUs work. Even with Slurm-on-Kubernetes frameworks where multiple pods could theoretically be run on a node, they must be pre-sliced into fixed sizes for Slurm, compared to dynamic sizing and bin-packing on Kubernetes.
  • Reinforcement learning post-training: One simple breakage we've observed beyond workload complexity alone is the need to spin up CPU evaluation environments on the fly during post training. Slurm has no support for launching CPU containers on the fly.
  • Composite systems: Sophisticated or recurring training sequences compose training, inference, evaluation, and data processing together, all with different compute needs. Executing these on Slurm means humans executing them one by one, which is obviously unsatisfactory for frequent or automated retraining. Trust us (and the entire data engineering world) that attempting to glue things together in bash scripts is asking for one nine of fault-tolerance.

Poor Reproducibility

Reproducibility is a consideration across two individuals on a team and across two environments. In one team, a new intern should be able to check out anyone’s code, make a small change, and test it. Then, that code should also run identically whether manually invoked, in CI, or on a recurring basis.

  • Tribal knowledge: Program execution is specified through config flags and filesystem state (e.g., activating someone’s conda environment and hoping it didn’t change), and so there is a config explosion and setup sequence that makes invoking a program become a magic incantation (or human process).
  • The filesystem is state: Both code and data are state that must remain stable during training, and then managed through manual additional processes to ensure reproducibility later. Momentarily checking out a different branch on your jump box could nuke an actively running training!

No Production

Slurm is built entirely for research without any real pathway to production, whether recurring training or inference. There are only a few teams at a few organizations that truly operate in this space (AI labs doing massive pre-training, PhD researchers).

  • Translation for production: Anything that needs to be regularly rerun needs to be translated (over weeks) into a pipeline or containerized application by manual effort.
  • No inference: Whether for online or offline tasks (or RL mentioned above), there’s no concept of an inference service that can be launched and autoscaled.
  • No ecosystem for automation or observability: While observability and automation are entirely different concerns, the failure point here is that there’s no production Slurm, so no tools to support production Slurm (compare Kubernetes, with a rich ecosystem for both).
  • A cleaved compute pool: Choosing Slurm means taking the available resources and pre-determining the proportion to use on research and production without easy ways to reallocate based on needs.

What is the idealized, “best of both worlds,” solution?

With consideration of the advantages and shortcomings above, as well as input from the hundreds of research and production ML teams we work with today, we believe a direct Kubernetes-native solution is possible for ML development with iteration at scale. Such a solution must have:

  • Scaling: Users must be able to take workloads in a familiar, existing envelop and direct the platform to execute them across many nodes, at the full compute and data scale of Kubernetes. This includes distributed execution (e.g. PyTorch, Ray), as well as autoscaling for inference.
  • Iteration Speed: The platform should not needlessly rebuild and redeploy with no awareness of existing state - making a code-only change and rerunning should take seconds at most to hot-sync into the existing environment, even for distributed or autoscaling programs.
  • Dynamic compute: Workloads should have on-the-fly access to arbitrary compute - any shape, image, or configuration that Kubernetes itself supports - allowing them to efficiently utilize compute resources only as needed, and adjust based on runtime conditions (e.g. I just caught an OOM, I need a larger machine).
  • Reproducible/Automatable: Start-to-end application and infrastructure logic should all be captured, versionable, and runnable as-is in code. Meaning, code you run locally should run identically in CI, production, and a new intern’s laptop - a completely closed loop to push new research into production or pull production code for local debugging or experimentation. This also means complex workloads run in full without human process and magic CLI incantations, so they can be automated on day 0.
  • Rich, Unified Scheduling: Scheduling must be rich and stable, but also must be applied evenly across all workloads. There should not be multiple scheduling pools/regimes for training, inference, batch, etc., which require extra headroom each to not bump into each other.
  • Debugging / Fault-tolerance: Any debugging channels that can be provided per compliance should be available across any workload: propagating logs and exceptions, SSH into any node, PDB debugging inside live processes, profiling, utilization monitoring, etc. Errors should dependably propagate for action or recourse in code, so fault-tolerance can be applied surgically as a user concern, rather than coarsely through platform-level actions like retries.
  • Production-Readiness: In addition to the reproducibility, scaling, and fault-tolerance points above, the system must have Kubernetes’s rich tooling ecosystems and configurability for infrastructure concerns like observability, networking, security, optimization, etc.

Our approach - Kubetorch

Without turning this post into marketing, we’ve built Kubetorch, a Kubernetes-native platform which is carefully and ambitiously designed to achieve all of the above. It allows your team to execute Python workloads on cloud compute of any shape or scale, with extremely fast iteration, about 1-2 seconds. Please reach out to hello@run.house if you’re interested in working with us.

Gaps that Remain

There remains a major gap in the Kubernetes ecosystem around scheduling. Between Kueue, Volcano, Yunicorn, and others, there does not appear to be a scheduling and prioritization standard that is as rich, stable, and customizable as Slurm. In discussions with our partners, we’ve heard some strong negative sentiments about the ability of tools working “out-of-the-box” to support scheduling without manual setup work. There’s the further complication of GPU scheduling, where early efforts to open source the Kubernetes AI scheduler are still early.

We hope to see and contribute to more progress in this area. If you have opinions or would like to give us feedback here, let us know!

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.