Configuration
Load balancing YAML file configuration
Load balancing is configured declaratively via a YAML file. The YAML file has key fields:
- type: The type should be
gateway-load-balancing-config
. It helps the truefoundry platform identify that this is a load balancing configuration file. - name: The name can be anything like
<companyname>-gateway-load-balancing-config
. The name is only used for logging purposes and doesn’t have any other significance. - model_configs (Optional): Defines global constraints like rate limits or failure tolerance for each model. This is only needed if you want to load balance based on rate limits or not route to unhealthy models.
- rules: An array of rules, the datastructure for which is described in details below. Each rule defines the subset of requests to which the rule applies and the strategy to route the request. The rules are evaluated in the order they are defined. The first rule that matches the request is applied and all subsequent rules are ignored. Hence, its recommended to be to exactly define which subset of requests each rule applies to.
YAML Structure
model_configs
The model_configs
section defines global constraints for individual models. These constraints are respected by all the rules.
For e.g. if we define that azure/gpt4 model has a rate limit of 50000 tokens per minute, and if any of the load-balancing rules targets requests going to azure/gpt4 model, it will stop routing requests to azure/gpt4 model once it hits the limit and choose the next best model.
Similarily, if we define that azure/gpt4 model can only tolerate 3 failures per minute, and if it receives more than 3 failures, it will be marked as unhealthy and requests will not be routed to it. After a cooldown period of 5 minutes, we will start routing requests again to azure/gpt4 model - and 3 requests again fail, the model will be marked as unhealthy again and requests will not be routed to it.
Example:
rules
The rules section is the most important part of the load balancing configuration. It comprises of the following key parts:
-
id: A unique identifier for the rule. All rules in the array must have a unique id. This is used to identify the rule in logs and metrics.
-
when (Define the subset of requests on which the rule applies): Truefoundry AI gateway provides a very flexible configuration to define the exact subset of requests on which the rule applies. We can define based on the user calling the model, or the model name or any of the custom metadata key present in the request header
X-TFY-METADATA
. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.subjects
: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified usinguser:john-doe
orteam:engineering-team
orvirtual-account:acct_1234567890
.models
: Rule matches if the model name in the request matches any of the models in the list.metadata
: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specifymetadata: {environment: "production"}
, the rule will only match if the request has the metadata keyenvironment
with valueproduction
in the request headerX-TFY-METADATA
.
-
type (Routing strategy): Truefoundry AI gateways supports two routing strategies - weight based and latency based which are described below. The value of type field should either be
weight-based-routing
orlatency-based-routing
depending on which strategy to use. To understand how these strategies work, check out how gateway does load balancing. -
load_balance_targets (
Models to route traffic to
): This defined the list of models which will be eligible for routing requests for this rule.
Weight-Based Strategy Configuration
- Each target must have a
weight
value - Weights must be integers between 0 and 100
- Sum of all weights in a rule must equal 100
- Optional
override_params
can be specified per target
Latency-Based Strategy Configuration
- No weight values are needed
- Optional
override_params
can be specified per target - System automatically monitors and routes to fastest models
- Models are considered “fast” if their response time is within 1.2x of the fastest model
- System needs at least 3 requests to start evaluating a model’s performance
- Uses either last 20 minutes or most recent 100 requests, whichever has fewer requests
Commonly Used Load Balancing Configurations
Here are a few examples of load balancing configurations for different use cases.
Gradually roll out a new gpt-4 model version
Gradually roll out a new gpt-4 model version
Route premium users to high-performance models using weight-based routing. If one model becomes unhealthy, it is automatically removed from routing until it recovers, ensuring consistent service availability.
Route traffic based on latency and token limits. If azure/gpt4 token limit is hit, it will route to openai/gpt4.
Route traffic based on dev or prod environment