Autoscaling
While deploying a service, you can provision multiple instances of your container (pod in the kubernetes world) so that the incoming traffic is load-balanced between the multiple instances. We refer to each such instance as a replica in Truefoundry. While deploying at Truefoundry, the default value of a replica is 1. While this is fine for development purposes, a single replica is not the best choice for production workloads because of the following reasons:
- If one replica goes down because of any reason, the entire service goes down. Having 2-3 replicas at minimum helps provide more fault-tolerance.
- A single replica might not be able to take all of the incoming traffic and hence cause client side errors.
To increase the number of replicas, we can either set it to a fixed higher value or enable autoscaling so that the number of replicas is automatically adjusted based on some configuration of incoming traffic, or cpu usage.
Setting Fixed Number of Replicas
This can be a good choice if the incoming traffic remains constant and we have a good idea of the number of replicas needed. To set the replicas using the UI, you can follow the steps below:
If you are using Python SDK to manage deployments then you need to add the following in your deploy.py file:
from truefoundry.deploy import Build, Service, DockerFileBuild, Port
service = Service(
name="my-service",
image=Build(build_spec=DockerFileBuild()),
ports=[
Port(
host="your_host",
port=8501
)
],
+ replica=3,
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
Enable Autoscaling
When the traffic or resource usage of the service is not constant, we ideally want the number of replicas to go up and down based on the incoming traffic. We need to define the minimum and maximum number of replicas in this case and the autoscaling strategy will decide what should be the number of replicas between the min and max replicas. We can enable autoscaling based on the following parameters:
- CPU Usage: If the application's cpu usage goes up when traffic increases, this can be a good parameter to autoscale on. For e.g., let's say your application (one replica) consumes 0.3 vCPU on steady state - however, as traffic goes up, the CPU usage starts increasing to a max of 1 CPU. In this case, setting autoscaling to trigger when CPU usage is greater than 0.6 can be a good idea.
- Requests Per Second (RPS): This is the easiest to reason about and calculate. We can benchmark our service to decide how many requests per second can be served by one replica using benchmarking tools like Locust. Let's say one replica can serve 10 requests per second without an degradation in quality (increase in latency or errors). If we expect the traffic to vary from 50 requests per second to 200 requests per second, we can set the minimum replicas to 5, maximum to 20 and set the rps autoscaling metric to 10.
- Time Based Autoscaling: This can be useful if the traffic patterns shift based on timezones, or you want to shut down dev workloads during non-office hours. For e.g, if you want to scale up replicas only between Monday to Friday 9AM to 9PM, you can set time based scheduling with cron start schedule as
0 9 * * 1-5
and end schedule as0 21 * * 1-5
.
In most cases, RPS turns out as the most effective and simple to understand metric for autoscaling.
Configure Autoscaling for an Application Via UI
Configure Autoscaling for an Application Via Python SDK
In your Service deployment code (deploy.py), include the following for autoscaling on RPS:
from truefoundry.deploy import Build, Service, DockerFileBuild, Port, Autoscaling, RPSMetric
service = Service(
name="my-service",
image=Build(build_spec=DockerFileBuild()),
ports=[
Port(
host="your_host",
port=8501
)
],
+ replicas = Autoscaling(
+ min_replicas=1,
+ max_replicas=3,
+ metrics=RPSMetric(
+ value=2
+ ),
+ )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
In your Service deployment code (deploy.py), include the following for autoscaling on CPU utilization:
from truefoundry.deploy import Build, Service, DockerFileBuild, Port, Autoscaling, CPUUtilizationMetric
service = Service(
name="my-service",
image=Build(build_spec=DockerFileBuild()),
ports=[
Port(
host="your_host",
port=8501
)
],
+ replicas = Autoscaling(
+ min_replicas=1,
+ max_replicas=3,
+ metrics=CPUUtilizationMetric(
+ value=30
+ ),
+ )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
In your Service deployment code (deploy.py), include the following for time based autoscaling:
from truefoundry.deploy import Build, Service, DockerFileBuild, Port, Autoscaling, CronMetric
service = Service(
name="my-service",
image=Build(build_spec=DockerFileBuild()),
ports=[
Port(
host="your_host",
port=8501
)
],
+ replicas = Autoscaling(
+ min_replicas=1,
+ max_replicas=3,
+ metrics=CronMetric(
+ desired_replicas: Any | None,
+ start: "0 9 * * 1-5", # Check https://crontab.guru/ for cron expressions
+ end: "0 21 * * 1-5", # You can see the list of supported timezones at https://docs.truefoundry.com/docs/list-of-supported-timezones
+ timezone: "Europe/London"
+ ),
+ )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
Advanced configuration
polling_interval
: Time interval in seconds to check the metrics on. By default, autoscaler will check metrics every 30 seconds.cooldown_period
: Time in seconds for which autoscaler waits before scaling to absolutezero
replicas. It is only application in case themin_replicas
is set to0
. (default: 300)
Auto Shutdown
This feature allows your workloads to automatically scale down to zero when they are idle, effectively eliminating unnecessary costs without sacrificing performance. You can specify the time during which, if no requests are received, the replicas should be scaled down to zero.
Installation
You need to install an infra component called Elasti before using it. You can go to cluster page and install the component from modules.
Usage
This feature can be enabled from the advanced configuration in a Service's deployment configuration page.
Updated about 1 month ago