Slurm for ML Research: Why It’s Great and What Is Missing

Increasingly, we are being asked by ML infrastructure teams what to do about Slurm. Should we support it? Will it ultimately slow down time to production? We work with dozens of such teams, and decided to weigh in here with a broad survey of the patterns and anti-patterns of Slurm in modern ML environments.

Paul Yang

ML @ 🏃‍♀️Runhouse🏠

Why Slurm Works
Where Does Slurm Struggle?
Enter Kubetorch
Gaps that Remain

Even if Slurm is a traditional HPC scheduler, it still dominates academia and ably serves as the compute foundation for many new AI labs. The three notable areas in which Slurm is a significant improvement over other common reference solutions are: ability to command scale, fast iteration loops, and scheduling. These are table stakes for researchers, but most other AI/ML systems force you to trade off between iterability and scale.

Single-node tools like dev boxes or notebooks have fast, interactive iteration, but lack the ability to support proper reproducibility or scale. By contrast, Kubernetes is now the definitive compute foundation for ML engineering teams (and how compute is typically delivered), but has horrific development for researchers. The unfamiliarity of debugging in Kubernetes, YAML, and long iteration loops with rebuilding containers and relaunching pods all make Kubernetes and the standard Kubernetes tools (Argo, Kubeflow, etc.) inadequate for research. To paper over these challenges, we’ve observed a proliferation of Slurm-like interfaces into Kubernetes (SUNK, Slinky, Soperator) for the researchers.

However, Slurm lacks critical features that are ideal for an ML development system, even if it has been a convenient stopgap. Slurm struggles with debuggability, heterogeneity of workloads, reproducibility without human processes, and fundamentally having no way to reach production. For academic research and extreme throughput-oriented pre-training jobs, these are not problems. But for enterprise ML and increasingly even for research-focused labs, these represent urgent pain points that we aim to solve via a Kubernetes-native approach. We built Kubetorch to preserve the fast iteration at scale, while granting users the ability to flexibly build, debug, and deploy complex AI systems on Kubernetes.

Why Slurm Works

Commanding Scale

For most teams, Slurm is the simplest way to launch a multi-node distributed training. Other research systems like devboxes, SSH into VMs, or Notebooks are largely limited to a single node.

Many nodes, one CLI command: Slurm lets you launch massive multi-node jobs with a single command, hiding the cluster’s complexity behind a familiar shell prompt.
Elastic growth without rewrites: The same srun line works on 4 GPUs or 1,000 GPUs, so your training loops can access a “go faster” button.
Simple for distributed communication: Slurm was built for MPI-style launching (SPMD), so it accommodates PyTorch Distributed, Tensorflow Distributed, Horovod, DeepSpeed, etc. out of the box without further abstractions or DSLs.

Restoring the Iteration Loop

You can scale further than a single node, but most systems designed for that level of distribution require long iteration loops for development. Slurm has great iteration at scale.

Compute (can) stay warm: For specific configurations and executions with srun, keeping the compute warm for rerun enables fast iteration.
Shared filesystem: Data and code both live on shared network storage and there is no setup latency to re-execute work.
No container rebuilds: Requiring a rebuild of a container to rerun a workload is a fundamental flaw of many Kubernetes (or vendor) platforms, since that process can take 20+ minutes to move through CI, rebuild, and deploy.

Scheduler

The goal for a compute platform is to maximize priority-weighted throughput of workloads, and this is principally what Slurm was designed to do.

Precise resource requests: Users specify exact resources (CPUs, GPUs, memory, time), enabling the scheduler to tightly pack jobs and minimize idle capacity.
Rich policy support: Supports priorities, fair-share, preemption, reservations, and advanced QoS controls to optimize cluster usage for extremely valuable GPU resources.

Where Does Slurm Struggle?

Debugging

Iterating on code can be fast, but finding the error is not. ML workloads are long-running, and distributed execution makes everything much more complex. Slurm execution is running a bash command, and what happens after is ungovernable.

Cascading failures: Failures can easily destroy the entire environment and leave nothing to debug. Many teams must jury-rig tooling to dump state on CUDA/NCCL crashes.
No fault transparency or handling: Relatedly, there’s no propagation of errors and no recourse when they happen; errors must be handled entirely from within the application itself, with no “catching” possible. Some have adopted Ray on Slurm (on Kubernetes!) to try to propagate errors more transparently to a single node.
Compute isn’t always warm: When you cannot cordon off a “hero” group of nodes for yourself, you once again face slow iteration loops to debug your program as you wait in the queue over and over.

Struggles with Heterogeneity

Large-scale training is frequently viewed as a single task that you can hit “go” on, but the entire pipeline of a training is actually quite heterogeneous. Slurm’s lack of accommodation of heterogeneity is a critical point of failure, leading to AI labs outgrowing Slurm for this reason.

Fixed-size resources: A Slurm workload requests a single size of compute for the entire workload, without an easy way to decompose each task into the right resource size. This can often mean leaving hundreds or thousands of CPUs idle while the GPUs work.
No composite systems, no reinforcement learning: Post-training sequences are growing increasingly complex with RL, which requires composite training, inference, evaluation environments, and processing all with different compute needs. Outside of training, testing composite AI applications like a RAG system is impossible in Slurm; you can only try tuning a single component at a time.

Poor Reproducibility

Reproducibility is a consideration across two individuals on a team and across two environments. In one team, a new intern should be able to check out anyone’s code, make a small change, and test it. Then, that code should also run identically whether manually invoked, in CI, or on a recurring basis.

Tribal knowledge: Program execution is specified through config flags and filesystem state (e.g., activating someone’s conda environment and hoping it didn’t change), and so there is a config explosion and setup sequence that makes invoking a program become a magic incantation (or human process).
The filesystem is state: Both code and data are state that must remain stable during training, and then managed through manual additional processes to ensure reproducibility later. Momentarily checking out a different branch on your jump box could nuke an actively running training!

No Production

Slurm is built entirely for research without any real pathway to production, whether recurring training or inference. There are only a few teams at a few organizations that truly operate in this space (AI labs doing massive pre-training, PhD researchers).

Translation for production: Anything that needs to be regularly rerun needs to be translated (over weeks) into a pipeline or containerized application by manual effort.
No inference: Whether for online or offline tasks (or RL mentioned above), there’s no concept of an inference service that can be launched and autoscaled.
No ecosystem for automation or observability: While observability and automation are entirely different concerns, the failure point here is that there’s no production Slurm, so no tools to support production Slurm (compare Kubernetes, with a rich ecosystem for both).
A cleaved compute pool: Choosing Slurm means taking the available resources and pre-determining the proportion to use on research and production without easy ways to reallocate based on needs.

Enter Kubetorch

The macro trend we observe is more teams moving to homebrew solutions built over Kubernetes as a compute substrate. Taking inspiration from hundreds of research teams, we designed Kubetorch for scaled, iterable, debuggable execution of ML workloads. We retain all the facets that make Slurm the best existing solution for ML research work, while solving the problems we’ve detailed above.

Scale: In Python code, simply specify the resources you need and the distribution framework to wire up (PyTorch, Ray, etc). The underlying compute is Kubernetes, but it is fully abstracted and “go faster” is simply increasing the argument for the number of nodes.
Iteration Speed: The magic of Kubetorch is in fast iteration loops; it takes just 1-2 seconds to hot-sync and rerun code after changes on warm pods, even for distributed programs.
Scheduling: Beyond the autoscaling and scheduling support that comes out-of-the-box with Kubernetes, Kubetorch integrates well with a number of Kubernetes-native scheduling systems while rich support for namespacing and tagging enables more manual team-based scheduling.
Debuggable: When executing with Kubetorch, logs are streamed back and errors are propagated from remote execution for debugging that feels local. Additional features like distributed PDB debugging, resource utilization observability, and direct SSH access all make debugging simple.
Supports Heterogeneous Workloads: This means Kubetorch specializes in modern RL workflows, which do require the training system to richly support composite applications.
Reproducible: Application, image, and compute resources are all defined in code, and code developed on my machine can be run “as-is” in CI, in an orchestrator node, or on an intern laptop.
Production-Ready: In addition to reproducibility eliminating research-to-production time, Kubetorch also stands up proper services on Kubernetes with rich fault-tolerance. There’s zero cost or effort to make your research work into recurring training or online/offline inference services.

if __name__ == "__main__":
    args = parser.parse_args()

    img =kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3")

    compute = kt.Compute(
        gpus=4,
        image=img,
    ).distribute("pytorch", workers=12)
    
    gpu_train = kt.fn(train).to(compute) # Your trainining entrypoint
    gpu_train(args)

Gaps that Remain

There remains a major gap in the Kubernetes ecosystem around scheduling. Between Kueue, Volcano, Yunicorn, and others, there does not appear to be a scheduling and prioritization standard that is as rich, stable, and customizable as Slurm. In discussions with our partners, we’ve heard some strong negative sentiments about the ability of tools working “out-of-the-box” to support scheduling without manual setup work. There’s the further complication of GPU scheduling, where early efforts to open source the Kubernetes AI scheduler are still early.

We hope to see and contribute to more progress in this area. If you have opinions or would like to give us feedback here, let us know!

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.