The routing configuration is configured declaratively via a YAML file. The UI provides an easy to use way to configure the routing configuration. Routing Config UI The configuration has the following key fields:
  • type: The type should be gateway-load-balancing-config. It helps the truefoundry platform identify that this is a load balancing configuration file.
  • name: The name can be anything like <companyname>-gateway-load-balancing-config. The name is only used for logging purposes and doesn’t have any other significance.
  • rules: An array of rules, the datastructure for which is described in details below. Each rule defines the subset of requests to which the rule applies and the strategy to route the request. The rules are evaluated in the order they are defined. The first rule that matches the request is applied and all subsequent rules are ignored. Hence, it’s recommended to exactly define which subset of requests each rule applies to.

YAML Structure

# Configuration Details
name: string   # Required: Configuration name (e.g. "loadbalancing-config")                          
type: gateway-load-balancing-config 

# Rules
rules:
    # Required: Unique identifier for the rule                  
  - id: string 
    # Required: Must be "weight-based-routing", "latency-based-routing", or "priority-based-routing"         
    type: string
    # Required: Conditions for when to apply this rule        
    when:         
      # Optional: List of user/virtual account identifiers
      subjects: string[]
      # Required: List of model names to match 
      models: string[]
      # Optional: Additional matching criteria (e.g., { environment: "production" })   
      metadata: object
    load_balance_targets: # Required: List of models to route to
        # Required: Model identifier. The model identifier is the name of the model in the TrueFoundry AI Gateway.
      - target: string 
        # Required for weight-based routing: Integer weight value, 0-100. 
        weight: integer
        # Required for priority-based routing: Priority level (lower number = higher priority)
        priority: integer
        # Optional: Retry configuration for this target
        retry_config:
          attempts: integer & >0
          delay: integer & >0  # milliseconds
          on_status_codes: [...string]
        # Optional: Status codes that trigger fallback to other targets
        fallback_status_codes: [...string]
        # Optional: Whether this target can be used as fallback candidate
        fallback_candidate: bool
        # Optional: Model-specific parameters to override
        override_params: object

rules

The rules section is the most important part of the load balancing configuration. It comprises of the following key parts:
  1. id: A unique identifier for the rule. All rules in the array must have a unique id. This is used to identify the rule in logs and metrics.
  2. when (Define the subset of requests on which the rule applies): TrueFoundry AI gateway provides a very flexible configuration to define the exact subset of requests on which the rule applies. We can define based on the user calling the model, or the model name or any of the custom metadata key present in the request header X-TFY-METADATA. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.
    • subjects: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified using user:john-doe or team:engineering-team or virtualaccount:acct_1234567890.
    • models: Rule matches if the model name in the request matches any of the models in the list.
    • metadata: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specify metadata: {environment: "production"}, the rule will only match if the request has the metadata key environment with value production in the request header X-TFY-METADATA.
  3. type (Routing strategy): TrueFoundry AI gateways supports three routing strategies - weight-based, latency-based, and priority-based which are described below. The value of type field should be weight-based-routing, latency-based-routing, or priority-based-routing depending on which strategy to use. To understand how these strategies work, check out how gateway does load balancing.
  4. load_balance_targets (Models to route traffic to): This defines the list of models which will be eligible for routing requests for this rule. For each target, we can configure the following options:
    • Retry Configuration:
      • attempts: Number of retry attempts. Default value is 2.
      • delay: Delay between retries in milliseconds (default value is 100ms)
      • on_status_codes: List of HTTP status codes that should trigger a retry (Default value is: ["429", "500", "502", "503"])
    • Fallback Configuration:
      • fallback_status_codes: List of HTTP status codes that trigger fallback to other targets (Default value is: ["401", "403", "404", "429", "500", "502", "503"])
      • fallback_candidate: Boolean indicating whether this target can be used as a fallback option for other targets. Default values is true - meaning that this target can be used as a fallback option for other targets.
    • Override Parameters: This allows you to override specific parameters for each target. This can be useful for setting different temperature values for different models, adjusting max_tokens based on model capabilities or configuring model-specific parameters. Example:
    YAML
    override_params:
      temperature: 0.5
      max_tokens: 1000
    
    • Weight (Only for Weight-based Routing): The weight of the target model. The weight is used to distribute the requests to the target models. The weight is a number between 0 and 100. Its a compulsory field for weight-based routing. The sum of weights for all targets in a rule should be 100.
    • Priority (Only for Priority-based Routing): The priority of the target model. The priority is used to determine the order of the target models in case of fallback. The priority is a number between 0 and 100. Its a compulsory field for priority-based routing.

Commonly Used Routing Configurations

Here are a few examples of load balancing configurations for different use cases.