Autoscaling
While deploying a service, you can provision multiple instances of your container (pod in the kubernetes world) so that the incoming traffic is load-balanced between the multiple instances. We refer to each such instance as a replica in TrueFoundry. While deploying at TrueFoundry, the default value of a replica is 1. While this is fine for development purposes, a single replica is not the best choice for production workloads because of the following reasons:
- If one replica goes down because of any reason, the entire service goes down. Having 2-3 replicas at minimum helps provide more fault-tolerance.
- A single replica might not be able to take all of the incoming traffic and hence cause client side errors.
To increase the number of replicas, we can either set it to a fixed higher value or enable autoscaling so that the number of replicas is automatically adjusted based on some configuration of incoming traffic, or cpu usage.
Setting Fixed Number of Replicas
This can be a good choice if the incoming traffic remains constant and we have a good idea of the number of replicas needed. To set the replicas using the UI, you can follow the steps below:
Enable Autoscaling
When the traffic or resource usage of the service is not constant, we ideally want the number of replicas to go up and down based on the incoming traffic. We need to define the minimum and maximum number of replicas in this case and the autoscaling strategy will decide what should be the number of replicas between the min and max replicas. We can enable autoscaling based on the following parameters:
- CPU Usage: If the application’s cpu usage goes up when traffic increases, this can be a good parameter to autoscale on. For e.g., let’s say your application (one replica) consumes 0.3 vCPU on steady state - however, as traffic goes up, the CPU usage starts increasing to a max of 1 CPU. In this case, setting autoscaling to trigger when CPU usage is greater than 0.6 can be a good idea.
- Requests Per Second (RPS): This is the easiest to reason about and calculate. We can benchmark our service to decide how many requests per second can be served by one replica using benchmarking tools like Locust. Let’s say one replica can serve 10 requests per second without an degradation in quality (increase in latency or errors). If we expect the traffic to vary from 50 requests per second to 200 requests per second, we can set the minimum replicas to 5, maximum to 20 and set the rps autoscaling metric to 10.
- Time Based Autoscaling: This can be useful if the traffic patterns shift based on timezones, or you want to shut down dev workloads during non-office hours. For e.g, if you want to scale up replicas only between Monday to Friday 9AM to 9PM, you can set time based scheduling with cron start schedule as
0 9 * * 1-5
and end schedule as0 21 * * 1-5
.
In most ML workloads, scaling based on requests per second is the most effective metric of scaling. This is because most ML libraries are optimized to max out CPU and GPU usage.
Configure Autoscaling for an Application
The key things to configure for autoscaling are:
- Min Replicas: The minimum number of replicas to scale down to.
- Max Replicas: The maximum number of replicas to scale up to.
- Autoscaling Metric: The metric to autoscale on.
- Autoscaling Value: The value of the metric to autoscale on.
- Polling Interval (Advanced): This defines the time interval at which the metrics are checked to decide if the service needs to be scaled up/down. The default values is 30 seconds. You can make it lower to 10 seconds in case you want to scale more quickly.
How to decide the autoscaling metric value
When deploying a service into production, it becomes essential to thoroughly understand its performance metrics. Key considerations include:
- Determining the service’s capacity in terms of handling requests per second.
- Assessing the threshold of concurrent requests, indicating the number of users the service can accommodate simultaneously.
- Analyzing how latency fluctuates with increasing traffic volume.
These inquiries are important as they inform crucial decisions regarding service configurations and operational strategies:
-
Determining the requisite number of replicas necessary.
-
Evaluating the necessity of implementing auto-scaling mechanisms.
-
If auto-scaling is necessary:
- Selecting appropriate metrics for triggering auto-scaling mechanisms (e.g., request per second, CPU utilization, or custom metrics).
- Establishing thresholds for these auto-scaling strategies.
This evaluation ensures adherence to service level agreements (SLAs) while simultaneously optimizing costs.
In this guide, we will use Locust which is an open-sourced tool for benchmarking services.
Setup
You can setup locust in any environment with python installed with the following command:
Writing the Locust File
In order to benchmark your service with locust, you need to write a locust file. You need to define what API Endpoint (path) you need to benchmark and write a sample request for the same.
Here is a small example to benchmark a deployed LLM. You can write this script to benchmark to any service and not just LLMs.
Note: You will need to replace your model name with the name of deployed service.
Running the benchmarks
You can start the launcher for locust with the following command:
Once you run this, you will find the find a service running on port 8089
Now, open http://localhost:8089 in a browser window. You will find the UI like this:
In the section of host, paste the endpoint of your “Service” by copying the deployed endpoint as shown below:
Once you paste the link, you can click on “Start Swarming” after setting the following parameters:
- Number of Users: Number of concurrent users that will bombard your service
- Spawn Rate: If multiple users are selected, the rate at which it will create new users (this can be 1 by default)
Once you start swarming, you can see the results on the dashboard:
Once this is setup, you can edit and increase the number of users from top by clicking on Edit.
You can view the detailed charts by clicking on “Charts” tab as shown in the picture. The results look something like this.
Deploying this Locust Script as a Service:
While you can run this script locally, your internet speed and difference in local setup of different users might affect the results.
For this, you can deploy this as a service on TrueFoundry.
You will need two files:
locust_benchmark.py
(shown above)requirements.txt
The contents of requirements.txt are:
Once you have this ready, please go to TrueFoundry UI and follow the following steps:
- Click on
New Deployment
button on top right of your screen. - Select your workspace
- Select
Code from Laptop
- Click on
Next
- Follow the guide from the UI
Step 2 to 4 are illustrated in the image below:
Once you complete all the steps and deploy your Service, you can access the deployed locust benchmarking script from the UI by clicking on the “Endpoint” as shown below: