Load balancing YAML file configuration
gateway-load-balancing-config
. It helps the truefoundry platform identify that this is a load balancing configuration file.<companyname>-gateway-load-balancing-config
. The name is only used for logging purposes and doesn’t have any other significance.model_configs
section defines global constraints for individual models. These constraints are respected by all the rules.
For e.g. if we define that azure/gpt4 model has a rate limit of 50000 tokens per minute, and if any of the load-balancing rules targets
requests going to azure/gpt4 model, it will stop routing requests to azure/gpt4 model once it hits the limit and choose the next best
model.
Similarily, if we define that azure/gpt4 model can only tolerate 3 failures per minute, and if it receives more than 3 failures, it will be marked as unhealthy and requests will not be routed to it. After a cooldown period of 5 minutes, we will start routing requests again to azure/gpt4 model - and 3 requests again fail, the model will be marked as unhealthy again and requests will not be routed to it.
Example:
X-TFY-METADATA
. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.
subjects
: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified using user:john-doe
or team:engineering-team
or virtual-account:acct_1234567890
.models
: Rule matches if the model name in the request matches any of the models in the list.metadata
: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specify metadata: {environment: "production"}
, the rule will only match if the request has the metadata key environment
with value production
in the request header X-TFY-METADATA
.weight-based-routing
or latency-based-routing
depending on which strategy to use. To understand how these strategies work, check out how gateway does load balancing.
Models to route traffic to
): This defined the list of models which will be eligible for routing requests for this rule.
weight
valueoverride_params
can be specified per targetoverride_params
can be specified per target