Autoscaling and Distribution
Autoscaling
Kubetorch supports automatic scaling of your services based on concurrency or request rate. Critically, this also controls auto-scaling down to zero; you might want to leave up minimum scale or set a generous scale_to_zero_grace_period
to ensure interactive development is not disrupted, and keeping your service warm for hot reloads.
-
target
(int):
The threshold of concurrent requests or requests per second that triggers scaling.
Default:100
-
window
(str):
Time window to average metrics over (e.g.,"30s"
,"1m"
).
Default:"60s"
-
metric
("concurrency"
or"rps"
):
Metric to base scaling on:"concurrency"
: number of simultaneous in-flight requests"rps"
: requests per second
Default:"concurrency"
-
target_utilization
(int):
The percentage of the target at which scaling should begin.
E.g., iftarget=100
andtarget_utilization=70
, scaling starts at 70 active requests.
Default:70
-
min_scale
(int, optional):
Minimum number of replicas to keep alive. Set to0
to allow scale-to-zero.
Default:None
(scale-to-zero disabled) -
max_scale
(int):
Maximum number of replicas to scale to.
Default:0
(no limit) -
initial_scale
(int):
Number of pods to start with initially before autoscaling takes effect.
Default:1
-
scale_to_zero_grace_period
(int, optional):
Time (in seconds) to wait before scaling to zero after inactivity.
Setting this enables scale-to-zero, and overridesmin_scale
.
Example:300
means scale to zero after 5 minutes idle.
Distribution
Building on top of scaling, Kubetorch provides a number of options to use distributed frameworks. The use is quite simple:
distribution_type
(str): The type of distributed supervisor to create. Options include 'ray', 'pytorch', etc. For instance, for PyTorch, we will setup the environment variables required for distributed for you.workers
(str, optional): The number of workers to create.
Some manual configuration may be needed on your cluster to use some of the distribution methods (e.g. Spark). Let us know if you run into any issues.
Example
As an example:
import kubetorch gpus = kt.Compute( gpus=1, cpus=3, memory=12, inactivity_ttl = "24h", launch_timeout = 60, freeze = (os.environ.get("ENVIRONMENT", "DEV") == "PROD"), image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3").pip_install(["vllm"]), ).distribute("pytorch", workers=4, scale_to_zero_grace_period=60)