Kubernetes Installation

You can install Kubetorch on an existing cluster or with a new one. If you are looking for a managed, serverless Kubetorch experience, please see our Serverless guide.

This guide will help you get to a working setup with Kubetorch using its base default settings. You will:

Download and helm install the Kubetorch chart on your cluster
Optionally enable features like autoscaling or Ray with additional installs.

After that, we recommend running our Hello World to ensure everything is working.

If you have any questions or need help, join our Slack community or reach out to the Runhouse team.

Join our Slack hello@run.house

Helm Chart Installation

Kubetorch Helm charts are hosted publicly on GitHub Container Registry (GHCR), so you can pull or install them directly — no authentication or token required.

helm registry login ghcr.io --username run-house --password-stdin

Install Kubetorch

You can install Kubetorch in several ways:

Option 1: Pull the chart locally

Download and extract the chart:

helm pull oci://ghcr.io/run-house/charts/kubetorch --version <VERSION> --untar

This creates a local directory named kubetorch. Update values.yaml if needed, then install:

helm upgrade --install kubetorch ./kubetorch -n kubetorch --create-namespace

Option 2: Install from OCI

Skip downloading and install directly from OCI:

helm upgrade --install kubetorch oci://ghcr.io/run-house/charts/kubetorch \
  --version <VERSION> -n kubetorch --create-namespace

Option 3: Install with Helmfile

If you prefer Helmfile, define the release in helmfile.yaml:

releases:
  - name: kubetorch
    namespace: kubetorch
    chart: oci://ghcr.io/run-house/charts/kubetorch
    version: <VERSION>
    values:
      - ./values.yaml # Adjust the path as needed

Then sync your releases:

helmfile sync

Install Knative (Optional)

Autoscaling Kubetorch services use Knative. You may skip this step if you are not planning to use autoscaling.

If Knative isn’t already installed, you can add the Operator by running:

helm repo add knative-operator https://knative.github.io/operator
helm repo update

helm install knative-operator --create-namespace --namespace knative-operator knative-operator/knative-operator

Note

If your Kubernetes cluster is version < 1.31.0, install Knative Operator < 1.18.0 with the --version flag

Next, we'll create a KnativeServing custom resource that configures and enables Knative Serving in the knative-serving namespace by applying the YAML in the Helm chart:

kubectl create namespace knative-serving
kubectl apply -f ./kubetorch/knative/serving.yaml

Install Ray (Optional)

Kubetorch supports Ray out of the box. To enable Ray, install the KubeRay Operator by running the following commands:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# Install both CRDs and KubeRay operator v1.4.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

For more information on installation and usage, see the KubeRay Operator documentation.

New Kubernetes Cluster

If you want to create a new Kubernetes cluster with Kubetorch installed, please use the Terraform script provided to you by the Kubetorch team.

This script will:

Create a new Kubernetes cluster
Install the Kubetorch Helm chart
Set up all necessary dependencies (including log streaming)

Additional Configuration

The following sections are optional and generally not necessary for a minimal working setup.

Deployment Namespaces

In addition to Kubetorch's release namespace, deployment_namespaces is a list of allowed namespaces for deploying Kubetorch workloads. This list ensures that required RBAC and watchers are created for these namespaces.

kubetorchConfig:
  deployment_namespaces:
    - "qa"
    - "training"

Workload Identity / Service Account Annotations

For cloud provider integrations (e.g., accessing cloud storage or secrets), you can add annotations to the Kubetorch service account:

# GKE Workload Identity
kubetorchConfig:
  serviceAccountAnnotations:
    iam.gke.io/gcp-service-account: "my-sa@my-project.iam.gserviceaccount.com"

# AWS IAM Roles for Service Accounts (IRSA)
kubetorchConfig:
  serviceAccountAnnotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/my-role"

DNS Resolver

By default, Kubetorch will use the kube-dns resolver, which is the EKS/GKE default. If your cluster is using a different DNS resolver (like coredns), you can use the resolver field in the kubetorchController.nginx section of the values.yaml file to point to your DNS resolver service:

kubetorchController:
  nginx:
    resolver: "coredns.kube-system.svc.cluster.local"

Or if running with Helm directly:

helm upgrade --install kubetorch oci://ghcr.io/run-house/charts/kubetorch \
  --version <version> -n kubetorch --create-namespace \
  --set kubetorchController.nginx.resolver="coredns.kube-system.svc.cluster.local"

Taints & Tolerations

If your cluster uses taints or requires workloads to run on specific node groups, Kubetorch allows you to configure tolerations and node affinity for the controller and data store components. You can set these fields directly in values.yaml:

kubetorchController:
  tolerations:
    - operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: Exists

dataStore:
  tolerations:
    - operator: Exists
  affinity: {}

Or pass them during installation using --set:

helm upgrade --install kubetorch oci://ghcr.io/run-house/charts/kubetorch \
  --version <version> \
  -n kubetorch \
  --create-namespace \
  --set "kubetorchController.tolerations[0].operator=Exists" \
  --set "dataStore.tolerations[0].operator=Exists"

For more complex scheduling rules (multiple node groups, weighted affinities, custom taints, etc.), we recommend downloading the chart and editing the values.yaml directly.

Data Store Storage

For clusters with large data transfer needs, you may want to adjust the data store storage and connection limits:

dataStore:
  storage:
    size: 200Gi  # Default is 100Gi
    storageClassName: "fast-ssd"  # Use a specific storage class
  maxConnections: 1000  # Default is 500, increase for many concurrent workers
  timeout: 600  # Connection timeout in seconds

Log Retention

Adjust log retention based on your debugging and compliance needs:

logStreaming:
  retentionPeriod: 72h  # Default is 24h

Inactivity TTL

The TTL controller automatically tears down idle workloads to prevent resources from lingering and accumulating costs. It runs as a background task in the Kubetorch controller, periodically checking for inactive workloads and cleaning them up.

The TTL values configured here serve as cluster-wide defaults. Users can override these defaults on a per-workload basis when deploying their services.

kubetorchController:
  ttl:
    enabled: true  # Enabled by default
    intervalSeconds: 300  # How often to check for idle workloads (default: 5 minutes)
    prometheusUrl: "http://kubetorch-metrics.kubetorch.svc.cluster.local:9090"

If you're using kube-prometheus-stack and want to use your existing Prometheus instance instead of the bundled one, you can enable the PodMonitor:

kubetorchController:
  ttl:
    enabled: true
    prometheusUrl: "http://prometheus-operated.monitoring.svc.cluster.local:9090"  # Your Prometheus URL
    podMonitor:
      enabled: true
      prometheusLabel: "kube-prometheus"  # Label to match your Prometheus Operator instance
      additionalLabels: {}

Note

The TTL controller requires metrics.enabled: true to function. Disabling metrics will also disable automatic workload cleanup.

Feature Toggles

You can enable or disable specific Kubetorch features based on your needs:

# Disable GPU support if your cluster has no GPUs
nvidia-device-plugin:
  enabled: false

# Disable log streaming (not recommended for debugging)
logStreaming:
  enabled: false

# Disable metrics collection (disables TTL controller and autoscaling)
metrics:
  enabled: false

# Enable DCGM exporter for GPU metrics on EKS (GKE has this built-in)
dcgm-exporter:
  enabled: true