Load Balancing and Autoscaling

Kubetorch services can automatically scale based on demand. This guide covers configuring load-balanced services where requests are distributed round-robin across pods, with automatic scaling up and down.

When to Use Autoscaling

Use .autoscale() when you want:

  • Parallel request distribution: Each call is routed to one pod in the pool
  • Automatic scaling: Pods added/removed based on load
  • Scale-to-zero: No pods running when idle (cost optimization)

This is ideal for inference services, APIs, and any workload where each request is independent.

Note: Autoscaling and distributed computing are mutually exclusive. Use .autoscale() for load-balanced services or .distribute() for parallel execution across all pods simultaneously.

Basic Configuration

import kubetorch as kt compute = kt.Compute( cpus="2", gpus="1", memory="8Gi", ).autoscale( min_scale=1, # Minimum replicas (0 allows scale-to-zero) max_scale=10, # Maximum replicas target=10, # Target concurrent requests per pod ) remote_fn = kt.fn(inference).to(compute)

Autoscaler Options

Kubetorch supports two autoscaler implementations via Knative:

AutoscalerMetricsUse Case
KPA (default)Concurrency, RPSRequest-based scaling (most ML workloads)
HPACPU, Memory, CustomResource-based scaling

Knative Pod Autoscaler (KPA)

KPA is the default and recommended for most ML workloads. It scales based on request metrics.

Concurrency-Based Scaling

Scale based on concurrent requests per pod:

compute = kt.Compute(cpus="2", gpus="1").autoscale( metric="concurrency", # Scale on concurrent requests (default) target=10, # Target 10 concurrent requests per pod target_utilization=70, # Scale up at 70% of target (7 concurrent) min_scale=1, max_scale=20, )

How it works:

  • If pods average 7+ concurrent requests (70% of target=10), scale up
  • If pods average below target utilization, scale down
  • Lower target = more aggressive scaling (more pods)

RPS-Based Scaling

Scale based on requests per second:

compute = kt.Compute(cpus="2", gpus="1").autoscale( metric="rps", # Scale on requests per second target=100, # Target 100 RPS per pod target_utilization=80, # Scale up at 80 RPS min_scale=2, max_scale=50, )

Concurrency Limiting

Limit how many concurrent requests each pod handles:

compute = kt.Compute(cpus="2", gpus="1").autoscale( concurrency=5, # Max 5 concurrent requests per pod target=5, # Target same as limit min_scale=1, max_scale=10, )

When concurrency is set, additional requests queue until a slot is available. This prevents overloading pods with heavy workloads (e.g., large model inference).

Horizontal Pod Autoscaler (HPA)

HPA scales based on resource utilization. Use it when request metrics don't correlate with load.

CPU-Based Scaling

compute = kt.Compute(cpus="4", memory="8Gi").autoscale( autoscaler_class="hpa.autoscaling.knative.dev", metric="cpu", target=80, # Scale up at 80% CPU utilization min_scale=2, max_scale=20, )

Memory-Based Scaling

compute = kt.Compute(cpus="2", memory="16Gi").autoscale( autoscaler_class="hpa.autoscaling.knative.dev", metric="memory", target=70, # Scale up at 70% memory utilization min_scale=1, max_scale=10, )

Scale-to-Zero Configuration

Scale-to-zero reduces costs when services are idle, but adds cold-start latency.

compute = kt.Compute(cpus="2", gpus="1").autoscale( min_scale=0, # Allow scale to zero scale_to_zero_pod_retention_period="5m", # Keep last pod for 5 minutes scale_down_delay="2m", # Wait 2 minutes before scaling down )
ParameterDescriptionDefault
min_scaleSet to 0 to enable scale-to-zero-
scale_to_zero_pod_retention_periodTime to keep last pod before scaling to zero10m
scale_down_delayDelay before any scale-down action1m

For interactive development, keep min_scale=1 or set generous retention periods to avoid cold starts disrupting your workflow.

Scaling Timing

Control how quickly the autoscaler responds:

compute = kt.Compute(cpus="2", gpus="1").autoscale( window="60s", # Time window for averaging metrics scale_down_delay="5m", # Wait 5 min before scaling down initial_scale=3, # Start with 3 replicas min_scale=1, max_scale=20, )
ParameterDescriptionDefault
windowTime window for metric averaging60s
scale_down_delayDelay before scaling down (KPA only)1m
initial_scaleInitial replica count on first deploymin_scale
progress_deadlineMax time for deployment to become ready10m

Tip: For ML workloads with slow initialization (model loading), set progress_deadline higher than your expected startup time.

Complete Configuration Reference

compute = kt.Compute(cpus="4", gpus="1", memory="16Gi").autoscale( # Scaling bounds min_scale=1, max_scale=50, initial_scale=2, # Metric configuration (KPA) metric="concurrency", # "concurrency", "rps", "cpu", "memory" target=10, # Target value per pod target_utilization=70, # % of target to trigger scaling # Request handling concurrency=20, # Max concurrent requests per pod # Timing window="60s", # Metric averaging window scale_down_delay="2m", # Delay before scale-down scale_to_zero_pod_retention_period="10m", # Keep last pod this long progress_deadline="15m", # Max deployment time # Autoscaler selection (for HPA) # autoscaler_class="hpa.autoscaling.knative.dev", )

Example: Inference Service

A typical inference service configuration:

import kubetorch as kt def predict(inputs): # Model loaded once per pod, reused across requests model = load_model() # Cached return model(inputs) compute = kt.Compute( cpus="4", gpus="1", memory="16Gi", image=kt.Image("pytorch/pytorch:2.0.0").pip_install(["transformers"]), ).autoscale( min_scale=1, # Keep at least 1 pod warm max_scale=10, # Scale up to 10 pods under load target=5, # Target 5 concurrent requests per pod concurrency=10, # But allow up to 10 (queue the rest) scale_down_delay="5m", # Wait 5 min before scaling down ) inference_fn = kt.fn(predict).to(compute) # Each call is routed to one pod result = inference_fn(my_inputs)

Example: Batch Processing with Scale-to-Zero

For batch jobs that run periodically:

import kubetorch as kt compute = kt.Compute(cpus="8", memory="32Gi").autoscale( min_scale=0, # Scale to zero when idle max_scale=20, # Scale up for large batches metric="rps", target=50, # Target 50 requests/sec per pod scale_to_zero_pod_retention_period="2m", # Quick scale-down ) process_fn = kt.fn(process_batch).to(compute) # Process items - pods scale up automatically for batch in batches: results = process_fn(batch) # After processing, pods scale to zero

Monitoring Autoscaling

View current replica count and scaling events:

# View service status kt status my-service # View pod count kubectl get pods -l serving.knative.dev/service=my-service # View autoscaler decisions kubectl describe ksvc my-service

Requirements

Autoscaling requires Knative Serving installed in your cluster. See the installation guide for setup instructions.

Without Knative, services deploy as standard Kubernetes Deployments with fixed replica counts.