Load balancing is configured declaratively via a YAML file. The YAML file has key fields:

  • type: The type should be gateway-load-balancing-config. It helps the truefoundry platform identify that this is a load balancing configuration file.
  • name: The name can be anything like <companyname>-gateway-load-balancing-config. The name is only used for logging purposes and doesn’t have any other significance.
  • model_configs (Optional): Defines global constraints like rate limits or failure tolerance for each model. This is only needed if you want to load balance based on rate limits or not route to unhealthy models.
  • rules: An array of rules, the datastructure for which is described in details below. Each rule defines the subset of requests to which the rule applies and the strategy to route the request. The rules are evaluated in the order they are defined. The first rule that matches the request is applied and all subsequent rules are ignored. Hence, its recommended to be to exactly define which subset of requests each rule applies to.

YAML Structure

# Configuration Details

name: string   # Required: Configuration name (e.g. "loadbalancing-config")                          
type: gateway-load-balancing-config 

# Model Configurations
model_configs:
    # Required: Identifier of the model in the TrueFoundry AI Gateway for e.g  azure/gpt4
  - model: string        
     # Optional: Rate limiting configuration
    usage_limits:       
      # Optional: Maximum number of tokens the model is allowed to process per minute - for .e.g 100000
      tokens_per_minute: integer
      # Optional: Maximum number of requests the model can serve per minute - for .e.g 1000
      requests_per_minute: integer 
    # Optional: Failure handling configuration
    failure_tolerance:
      # Required: Max failures allowed per minute before marking the model as unhealthy. Once a model is marked as unhealthy, it will not receive any traffic until the cooldown period is over.
      allowed_failures_per_minute: integer 
      # Required: Cooldown duration in minutes before the model is considered healthy again after being marked as unhealthy.
      cooldown_period_minutes: integer      
      # Required: List of HTTP status codes that indicate a failed request.
      failure_status_codes: [integer]       
# Rules
rules:
    # Required: Unique identifier for the rule                  
  - id: string 
    # Required: Must be either "weight-based-routing" or "latency-based-routing"         
    type: string
    # Required: Conditions for when to apply this rule        
    when:         
      # Optional: List of user/virtual account identifiers
      subjects: string[]
      # Required: List of model names to match 
      models: string[]
      # Optional: Additional matching criteria (e.g., { environment: "production" })   
      metadata: object
    load_balance_targets: # Required: List of models to route to
        # Required: Model identifier. The model identifier is the name of the model in the TrueFoundry AI Gateway.
      - target: string 
        # Required for weight-based routing: Must be between 0-100. This shouldn't be provided if using latency-based routing.  
        weight: integer
        # Optional: Model-specific parameters.
        override_params: object

model_configs

The model_configs section defines global constraints for individual models. These constraints are respected by all the rules.

For e.g. if we define that azure/gpt4 model has a rate limit of 50000 tokens per minute, and if any of the load-balancing rules targets requests going to azure/gpt4 model, it will stop routing requests to azure/gpt4 model once it hits the limit and choose the next best model.

Similarily, if we define that azure/gpt4 model can only tolerate 3 failures per minute, and if it receives more than 3 failures, it will be marked as unhealthy and requests will not be routed to it. After a cooldown period of 5 minutes, we will start routing requests again to azure/gpt4 model - and 3 requests again fail, the model will be marked as unhealthy again and requests will not be routed to it.

Example:

model_configs:
  - model: "azure/gpt4"
    usage_limits:
      tokens_per_minute: 50000
      requests_per_minute: 100
    failure_tolerance:
      allowed_failures_per_minute: 3
      cooldown_period_minutes: 5
      failure_status_codes: [429, 500, 502, 503, 504]

rules

The rules section is the most important part of the load balancing configuration. It comprises of the following key parts:

  1. id: A unique identifier for the rule. All rules in the array must have a unique id. This is used to identify the rule in logs and metrics.

  2. when (Define the subset of requests on which the rule applies): Truefoundry AI gateway provides a very flexible configuration to define the exact subset of requests on which the rule applies. We can define based on the user calling the model, or the model name or any of the custom metadata key present in the request header X-TFY-METADATA. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.

    • subjects: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified using user:john-doe or team:engineering-team or virtual-account:acct_1234567890.
    • models: Rule matches if the model name in the request matches any of the models in the list.
    • metadata: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specify metadata: {environment: "production"}, the rule will only match if the request has the metadata key environment with value production in the request header X-TFY-METADATA.
  3. type (Routing strategy): Truefoundry AI gateways supports two routing strategies - weight based and latency based which are described below. The value of type field should either be weight-based-routing or latency-based-routing depending on which strategy to use. To understand how these strategies work, check out how gateway does load balancing.

  4. load_balance_targets (Models to route traffic to): This defined the list of models which will be eligible for routing requests for this rule.

Weight-Based Strategy Configuration

  • Each target must have a weight value
  • Weights must be integers between 0 and 100
  • Sum of all weights in a rule must equal 100
  • Optional override_params can be specified per target
rules:
  - id: "production-rollout"
    type: "weight-based-routing"
    when:
      models: 
        - "gpt-4"
    load_balance_targets:
      - target: "azure/gpt4"
        weight: 80
        override_params:
          temperature: 0.7
      - target: "openai/gpt4"
        weight: 20

Latency-Based Strategy Configuration

  • No weight values are needed
  • Optional override_params can be specified per target
  • System automatically monitors and routes to fastest models
  • Models are considered “fast” if their response time is within 1.2x of the fastest model
  • System needs at least 3 requests to start evaluating a model’s performance
  • Uses either last 20 minutes or most recent 100 requests, whichever has fewer requests
rules:
  - id: "performance-optimized"
    type: "latency-based-routing"
    when:
      models: 
        - "claude-3"
    load_balance_targets:
      - target: "anthropic/claude-3-opus"
      - target: "anthropic/claude-3-sonnet"
        override_params:
          max_tokens: 1000

Commonly Used Load Balancing Configurations

Here are a few examples of load balancing configurations for different use cases.

Gradually roll out a new gpt-4 model version

YAML
    name: loadbalancing-config
    type: gateway-load-balancing-config
    rules:
    - id: "gpt4-canary"
        type: "weight-based-routing"
        when:
        models: 
            - "gpt-4"
        load_balance_targets:
        - target: "azure/gpt4-v1"  # Current production model
            weight: 90
        - target: "azure/gpt4-v2"  # New model version
            weight: 10