Kubetorch logo two hearts kubeflow

The Best Way to Use Kubeflow Trainer Is with Kubetorch

By using Kubetorch with Kubeflow Trainer, you will get the same benefits you already experience with TrainJobs, but with interactive iteration speed on Kubernetes, unlock RL, and gain next-level fault tolerance and programmatic control.

Photo of Paul Yang
Paul Yang

ML @ 🏃‍♀️Runhouse🏠

December 17, 2025

Kubetorch is the best way to launch and iterate on workloads based on Kubeflow Trainer resources (both v2 TrainJobs and v1 PyTorchJobs) on Kubernetes. We assume some familiarity already with Kubeflow Trainer, but briefly, it offers CRDs with a rich feature set to enable training on Kubernetes, with strong support for distributed execution, platform configurability, integrations with scheduling tools like Kueue and Volcano, and support for gang scheduling.

These are provided as optional resource flavor types when Kubetorch deploys your regular classes and functions to Kubernetes as long-lived “actors.” You can make repeated regular calls against these Kubernetes-based actors to execute work like `train_epoch()` as if the actors were local and update the actors without tearing down the resources.

Specifically, by using Kubetorch to launch the Trainer jobs, you will:

  1. Experience fast on-cluster interactive iteration speed (<2 seconds to redeploy code changes) for distributed programs, dramatically improving the development experience.
  2. Enable repeated interactions with a long-lived stateful training actor that not only makes inference and evaluations directly on the same resources, but also makes reinforcement learning possible with Kubeflow Trainer primitives too.
  3. Unlock fault-tolerant training and powerful programmatic control over execution.

Iteration Speed

There are several activities within iteration and development that Kubetorch makes effectively costless, greatly improving development and ML research.

1. Propagating Changes: When using vanilla Kubeflow, the two most popular ways to deploy updates are via Docker and via Git. Git provides a simpler path: push to a branch, pull the module from that branch, and the code is synchronized. However, it relies on being able to use a single Docker image across every iteration and every project. This assumes no system packages need to be installed via apt-get or any bash commands that need to run. Additionally, any heavy installs must be rerun constantly, adding potentially significant latency. By contrast, rebuilding Docker images with code updates obviously delivers significant flexibility, but at the cost of introducing 15-30 minute loops to build, re-push, and re-pull the image. Teams more commonly rebuild Docker images despite the time cost, simply because a pip-only route is too limiting. However, this significantly impacts developer satisfaction on the "deploy-to-test" path.

With Kubetorch, syncing of local changes is automatic. All of the work started by Kubetorch is long-lived, and so the pods stay warm and ready for redeployment. Any local changes propagate within 2 seconds to remote, and training can restart almost immediately with no overhead.

2. Environment Persistence: Many tools give you the ability to interactively update the environment that you are working in, such as pip installing a library instead of fully rebuilding your underlying Docker image. However, upon the restart of your job, any of those interactive changes fall away. Installing Torch can take a few minutes, and building Flash Attention takes even more than that; this approach has proven undesirable in practice as it introduces state with an unclear lifecycle, can drive high costs, and has worse I/O performance that can topple at scale.

For Kubetorch, you can chain arbitrary mutations to your base image and have those persist until you want to restart your workload fully. Bash commands, pip installations, etc., are all maintained across multiple iterations of your underlying code.

3. Data and Artifact Caching: A closely related topic to environment persistence is the persistence of the data and model artifacts. Redownloading 50GB of model checkpoints easily takes a few minutes. Training batches can be streamed from blob storage, but caching preprocessed data on training pods for reuse can improve throughput and reduce egress costs, if relevant. Some teams can mount volumes to every pod and enable artifact persistence on a shared filesystem; this has proven undesirable in practice as it introduces state with a murky lifecycle, can drive high costs, and has worse I/O performance that can topple at scale.

With Kubetorch, the launched pods are kept alive until explicitly torn down, and so any artifact, data loading, and pre-processing can be cached on pod-local ephemeral storage and reused with zero overhead across iterations.

4. Queuing and Scheduling: In busy cluster environments operating at saturation (the ideal case with GPUs), each iteration loop requires requeuing for capacity. A common secondary issue when work lands on a node without the requisite Docker image for execution adds additional latency as well. Practically, we find that teams must over-allocate pools of compute for development in order to avoid starvation of developer progress. The model of research iteration that we want to target is similar to the one that Slurm provides (srun bash / salloc). The scheduling system should prioritize and return compute to researchers and engineers to ensure continuous, rapid iteration while they are productive. This is the experience when workloads are managed with Kubetorch: compute is held in place until torn down by the researcher, killed by an inactivity TTL, or pre-empted through platform scheduling systems.

Actor Model Unlocks Inference, Evaluations, and RL

With Kubetorch, you do not have to deploy monolithic training entrypoints for a training job, where the lifecycle ends when training ends. Instead, you can deploy a (distributed) class with arbitrary methods on it, make calls against it asynchronously and multithreaded, and interact with that deployed class until you are satisfied and tear it down. This is why you can describe Kubetorch as stateful actor deployments on Kubernetes with an optional TrainJob flavor for the actor.

This means that your actor can now support many post-training evaluations, including manual testing, without further redeployment. It’s very common to checkpoint your Kubeflow Training job and then load that checkpoint to play with inference of the new model. With Kubetorch, you can instantly start making calls against the new model. Or, ping it to a teammate, and they can access the same service and start playing with it too. During training, you can also, as a convenience, make multi-threaded calls into the training actor to interrogate the state about the training or make asynchronous calls to checkpoint outside of the main training loop.

Where this becomes extremely unique and valuable is for teams that are starting to do reinforcement learning. Outside of Ray, no real actor framework exists for ML on Kubernetes, and Ray forces layering. But how would you do multiple turns of inference, evaluation, and training with Kubeflow TrainJobs without Kubetorch? With this actor model, TrainJobs constitute a single component; training can be called again and again from a controller, and the distributed TrainJob can sync weights pod-to-pod to the inference servers.

Powerful Programmatic Control and Fault Tolerance

Building on the points about actors, with Kubetorch, you have a driver-plus-actor model (your local Python process + the distributed TrainJob-based pods on Kubernetes). When you launch training jobs with Kubetorch, you gain the ability to make decisions about execution flow from outside of the actual execution. This is in direct contrast with batch job submission-style execution, where, after the training job is sent away, all logic and fault tolerance must be in-workload, and any failure risks can cascade to destroy the entire job. When calls end, either because of an error on the remote side or because execution of a method finishes, it returns control and any return values back to the driver.

The most obvious benefit is fault tolerance. Any faults in the actor do not propagate back to destroy the driver program, but instead return control to the driver program to make further calls. So if you have a CUDA OOM because of large batch sizes, trivially make the same training call with a smaller batch size. Or if you have a change in world size because pods or nodes got preempted, the driver process can instruct the training job to restart without losing any of the data checkpointed to pod-local disk or even in GPU memory. But you also get programmatic control with this format, as you can make changes to parameters and hyperparameters with an awareness of global state, or you can manually intervene and then continue within seconds after a manual hyperparameter tweak.

Relationship with Pipelines (or Research to Production)

Before we conclude, it is worth briefly touching on how Kubetorch relates to Kubeflow Pipelines as well. Philosophically, we believe that your orchestrator should be a system that schedules execution, retries failures, and alerts you to errors, but does not impose any additional development overhead or require changing your research code at all. That’s why Kubetorch code drops cleanly into Kubeflow Pipelines, with the step being either Python or a bash command for executing Python.

Specifically, remember that Kubetorch APIs bring up compute and “dispatch” work to run on that compute. So this same few lines of code will run anywhere:

gpus = kt.Compute( gpus=1, image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3").pip_install(["datasets"]), launch_timeout=600, inactivity_ttl="1h", ).distribute("pytorch", workers=2) train_ddp = kt.fn(train).to(gpus) results = train_ddp(epochs=args.epochs, batch_size=args.batch_size)
  • During development, run from your local machine and Kubetorch does the fast redeploy to the remote service.
  • In CI, the CI runner runs the Kubetorch `.to()` and launches the workload on Kubernetes
  • For production runs in Pipelines, use Kubetorch to launch compute and make identical calls with no code changes required.
    • Optionally, if you built Docker image of your team repo, instead of dispatching with `.to()`, you can specify that production jobs should import it from the local Docker environment rather than syncing up from the calling driver environment.

Code Sample

import argparse import os import time import kubetorch as kt # We import the underlying trainer class, which is a regular Python class with methods like train(), load_data(), predict(), etc. from vhr10_dinov3_classifier import VHR10Trainer # A toy example of a TrainJob manifest, hardcoded for convenience container = { "name": "kubetorch", "image": "pytorch/pytorch:latest", "resources": { "requests": { "nvidia.com/gpu": 1, }, "limits": { "nvidia.com/gpu": 1, }, }, } TRAINJOB_MANIFEST = { "apiVersion": "trainer.kubeflow.org/v1alpha1", "kind": "TrainJob", "metadata": { "name": "", "namespace": "default", }, "spec": { "runtimeRef": {"name": "torch-distributed"}, "trainer": {"numNodes": 3}, "template": {"spec": {"containers": [container]}}, }, } def main(): start_time = time.time() parser = argparse.ArgumentParser(description="VHR10 Classification with DINOv3") parser.add_argument("--epochs", type=int, default=3, help="number of epochs") parser.add_argument("--batch-size", type=int, default=32, help="batch size") parser.add_argument("--lr", type=float, default=1e-4, help="learning rate") parser.add_argument( "--data-root", type=str, default="./data", help="data root directory" ) parser.add_argument( "--checkpoint-dir", type=str, default="./checkpoints", help="checkpoint directory", ) parser.add_argument( "--model-name", type=str, default="vitl16", choices=["vitl16", "vit7b16"], help="DINOv3 model variant (vitl16: distilled, vit7b16: original)", ) parser.add_argument( "--freeze-backbone", action="store_true", default=True, help="freeze backbone weights", ) parser.add_argument( "--workers", type=int, default=3, help="number of distributed workers" ) parser.add_argument( "--num-classes", type=int, default=10, help="number of classes (VHR10 uses labels 1-10)", ) parser.add_argument( "--threshold", type=float, default=0.3, help="probability to use for binary classification", ) args = parser.parse_args() # Override worker replicas based on passed args TRAINJOB_MANIFEST["spec"]["trainer"]["numNodes"] = args.workers # Define image with dependencies, which overrides the base container img = ( kt.Image(image_id="pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime") .pip_install( [ "torchgeo[datasets,models]", "transformers", "torchvision", "pillow", "soxr", ] ) .set_env_vars({"HF_TOKEN": os.environ["HF_TOKEN"]}) ) # Create compute from PyTorchJob manifest, and then update with fields for Kubetorch gpu_compute = kt.Compute.from_manifest(TRAINJOB_MANIFEST) gpu_compute.gpus = 1 gpu_compute.image = img gpu_compute.launch_timeout = 600 gpu_compute.inactivity_ttl = "2h" gpu_compute.distributed_config = {"quorum_workers": args.workers} # Dispatch trainer class to remote compute, launching as TrainJob/PTJ init_args = dict( data_root=args.data_root, checkpoint_dir=args.checkpoint_dir, ) remote_trainer = kt.cls(VHR10Trainer).to(gpu_compute, init_args=init_args) print("Time to first activity:", time.time() - start_time) # Run distributed training, calling methods on remote remote_trainer.setup( lr=args.lr, freeze_backbone=args.freeze_backbone, model_name=args.model_name, num_classes=args.num_classes, ) print("Time to setup:", time.time() - start_time) remote_trainer.load_data(args.batch_size, num_workers=0) print( "Time to start training:", time.time() - start_time ) # 50 seconds from warm node, 19 seconds for subsequent iterations to Trainer code remote_trainer.train(num_epochs=args.epochs, threshold=args.threshold) print("Training complete, total time:", time.time() - start_time) if __name__ == "__main__": main()

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.