Autoscaling and Distribution

Autoscaling

Kubetorch supports automatic scaling of your services based on concurrency or request rate. Critically, this also controls auto-scaling down to zero; you might want to leave up minimum scale or set a generous scale_to_zero_grace_period to ensure interactive development is not disrupted, and keeping your service warm for hot reloads.

target (int):
The threshold of concurrent requests or requests per second that triggers scaling.
Default: 100
window (str):
Time window to average metrics over (e.g., "30s", "1m").
Default: "60s"
metric ("concurrency" or "rps"):
Metric to base scaling on:
- "concurrency": number of simultaneous in-flight requests
- "rps": requests per second
  Default: "concurrency"
target_utilization (int):
The percentage of the target at which scaling should begin.
E.g., if target=100 and target_utilization=70, scaling starts at 70 active requests.
Default: 70
min_scale (int, optional):
Minimum number of replicas to keep alive. Set to 0 to allow scale-to-zero.
Default: None (scale-to-zero disabled)
max_scale (int):
Maximum number of replicas to scale to.
Default: 0 (no limit)
initial_scale (int):
Number of pods to start with initially before autoscaling takes effect.
Default: 1
scale_to_zero_grace_period (int, optional):
Time (in seconds) to wait before scaling to zero after inactivity.
Setting this enables scale-to-zero, and overrides min_scale.
Example: 300 means scale to zero after 5 minutes idle.

Distribution

Building on top of scaling, Kubetorch provides a number of options to use distributed frameworks. The use is quite simple:

distribution_type (str): The type of distributed supervisor to create. Options include 'ray', 'pytorch', etc. For instance, for PyTorch, we will setup the environment variables required for distributed for you.
workers (str, optional): The number of workers to create.

Some manual configuration may be needed on your cluster to use some of the distribution methods (e.g. Spark). Let us know if you run into any issues.

Example

As an example:

import kubetorch 
gpus = kt.Compute(
     gpus=1,
     cpus=3,
     memory=12, 
     inactivity_ttl = "24h", 
     launch_timeout = 60, 
     freeze = (os.environ.get("ENVIRONMENT", "DEV") == "PROD"), 
     image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3").pip_install(["vllm"]),
).distribute("pytorch", workers=4, scale_to_zero_grace_period=60)