Load Balancing and Autoscaling
Kubetorch services can automatically scale based on demand. This guide covers configuring load-balanced services where requests are distributed round-robin across pods, with automatic scaling up and down.
When to Use Autoscaling
Use .autoscale() when you want:
- Parallel request distribution: Each call is routed to one pod in the pool
- Automatic scaling: Pods added/removed based on load
- Scale-to-zero: No pods running when idle (cost optimization)
This is ideal for inference services, APIs, and any workload where each request is independent.
Note: Autoscaling and distributed computing are mutually exclusive. Use .autoscale() for
load-balanced services or .distribute() for parallel execution across all pods simultaneously.
Basic Configuration
import kubetorch as kt compute = kt.Compute( cpus="2", gpus="1", memory="8Gi", ).autoscale( min_scale=1, # Minimum replicas (0 allows scale-to-zero) max_scale=10, # Maximum replicas target=10, # Target concurrent requests per pod ) remote_fn = kt.fn(inference).to(compute)
Autoscaler Options
Kubetorch supports two autoscaler implementations via Knative:
| Autoscaler | Metrics | Use Case |
|---|---|---|
| KPA (default) | Concurrency, RPS | Request-based scaling (most ML workloads) |
| HPA | CPU, Memory, Custom | Resource-based scaling |
Knative Pod Autoscaler (KPA)
KPA is the default and recommended for most ML workloads. It scales based on request metrics.
Concurrency-Based Scaling
Scale based on concurrent requests per pod:
compute = kt.Compute(cpus="2", gpus="1").autoscale( metric="concurrency", # Scale on concurrent requests (default) target=10, # Target 10 concurrent requests per pod target_utilization=70, # Scale up at 70% of target (7 concurrent) min_scale=1, max_scale=20, )
How it works:
- If pods average 7+ concurrent requests (70% of target=10), scale up
- If pods average below target utilization, scale down
- Lower
target= more aggressive scaling (more pods)
RPS-Based Scaling
Scale based on requests per second:
compute = kt.Compute(cpus="2", gpus="1").autoscale( metric="rps", # Scale on requests per second target=100, # Target 100 RPS per pod target_utilization=80, # Scale up at 80 RPS min_scale=2, max_scale=50, )
Concurrency Limiting
Limit how many concurrent requests each pod handles:
compute = kt.Compute(cpus="2", gpus="1").autoscale( concurrency=5, # Max 5 concurrent requests per pod target=5, # Target same as limit min_scale=1, max_scale=10, )
When concurrency is set, additional requests queue until a slot is available. This prevents
overloading pods with heavy workloads (e.g., large model inference).
Horizontal Pod Autoscaler (HPA)
HPA scales based on resource utilization. Use it when request metrics don't correlate with load.
CPU-Based Scaling
compute = kt.Compute(cpus="4", memory="8Gi").autoscale( autoscaler_class="hpa.autoscaling.knative.dev", metric="cpu", target=80, # Scale up at 80% CPU utilization min_scale=2, max_scale=20, )
Memory-Based Scaling
compute = kt.Compute(cpus="2", memory="16Gi").autoscale( autoscaler_class="hpa.autoscaling.knative.dev", metric="memory", target=70, # Scale up at 70% memory utilization min_scale=1, max_scale=10, )
Scale-to-Zero Configuration
Scale-to-zero reduces costs when services are idle, but adds cold-start latency.
compute = kt.Compute(cpus="2", gpus="1").autoscale( min_scale=0, # Allow scale to zero scale_to_zero_pod_retention_period="5m", # Keep last pod for 5 minutes scale_down_delay="2m", # Wait 2 minutes before scaling down )
| Parameter | Description | Default |
|---|---|---|
min_scale | Set to 0 to enable scale-to-zero | - |
scale_to_zero_pod_retention_period | Time to keep last pod before scaling to zero | 10m |
scale_down_delay | Delay before any scale-down action | 1m |
For interactive development, keep min_scale=1 or set generous retention periods
to avoid cold starts disrupting your workflow.
Scaling Timing
Control how quickly the autoscaler responds:
compute = kt.Compute(cpus="2", gpus="1").autoscale( window="60s", # Time window for averaging metrics scale_down_delay="5m", # Wait 5 min before scaling down initial_scale=3, # Start with 3 replicas min_scale=1, max_scale=20, )
| Parameter | Description | Default |
|---|---|---|
window | Time window for metric averaging | 60s |
scale_down_delay | Delay before scaling down (KPA only) | 1m |
initial_scale | Initial replica count on first deploy | min_scale |
progress_deadline | Max time for deployment to become ready | 10m |
Tip: For ML workloads with slow initialization (model loading), set progress_deadline
higher than your expected startup time.
Complete Configuration Reference
compute = kt.Compute(cpus="4", gpus="1", memory="16Gi").autoscale( # Scaling bounds min_scale=1, max_scale=50, initial_scale=2, # Metric configuration (KPA) metric="concurrency", # "concurrency", "rps", "cpu", "memory" target=10, # Target value per pod target_utilization=70, # % of target to trigger scaling # Request handling concurrency=20, # Max concurrent requests per pod # Timing window="60s", # Metric averaging window scale_down_delay="2m", # Delay before scale-down scale_to_zero_pod_retention_period="10m", # Keep last pod this long progress_deadline="15m", # Max deployment time # Autoscaler selection (for HPA) # autoscaler_class="hpa.autoscaling.knative.dev", )
Example: Inference Service
A typical inference service configuration:
import kubetorch as kt def predict(inputs): # Model loaded once per pod, reused across requests model = load_model() # Cached return model(inputs) compute = kt.Compute( cpus="4", gpus="1", memory="16Gi", image=kt.Image("pytorch/pytorch:2.0.0").pip_install(["transformers"]), ).autoscale( min_scale=1, # Keep at least 1 pod warm max_scale=10, # Scale up to 10 pods under load target=5, # Target 5 concurrent requests per pod concurrency=10, # But allow up to 10 (queue the rest) scale_down_delay="5m", # Wait 5 min before scaling down ) inference_fn = kt.fn(predict).to(compute) # Each call is routed to one pod result = inference_fn(my_inputs)
Example: Batch Processing with Scale-to-Zero
For batch jobs that run periodically:
import kubetorch as kt compute = kt.Compute(cpus="8", memory="32Gi").autoscale( min_scale=0, # Scale to zero when idle max_scale=20, # Scale up for large batches metric="rps", target=50, # Target 50 requests/sec per pod scale_to_zero_pod_retention_period="2m", # Quick scale-down ) process_fn = kt.fn(process_batch).to(compute) # Process items - pods scale up automatically for batch in batches: results = process_fn(batch) # After processing, pods scale to zero
Monitoring Autoscaling
View current replica count and scaling events:
# View service status kt status my-service # View pod count kubectl get pods -l serving.knative.dev/service=my-service # View autoscaler decisions kubectl describe ksvc my-service
Requirements
Autoscaling requires Knative Serving installed in your cluster. See the installation guide for setup instructions.
Without Knative, services deploy as standard Kubernetes Deployments with fixed replica counts.