Weight-based routing distributes traffic across multiple model targets based on specified weight percentages. This strategy is ideal for gradual rollouts, A/B testing, or distributing load across different model providers.

Overview

Weight-based routing allows you to:
  • Distribute traffic proportionally across multiple models
  • Perform gradual rollouts with controlled traffic distribution
  • A/B test different model providers or configurations
  • Balance load across multiple endpoints

Configuration Structure

# Configuration Details
name: string   # Required: Configuration name (e.g. "weight-based-config")                          
type: gateway-load-balancing-config 

# Rules
rules:
  - id: string                    # Required: Unique identifier for the rule                  
    type: "weight-based-routing"  # Required: Must be "weight-based-routing"
    when:                         # Required: Conditions for when to apply this rule        
      subjects: string[]          # Optional: List of user/virtual account identifiers
      models: string[]            # Required: List of model names to match 
      metadata: object            # Optional: Additional matching criteria
    load_balance_targets:         # Required: List of models to route to
      - target: string            # Required: Model identifier
        weight: integer           # Required: Integer weight value, 0-100
        override_params: object            # Optional: Model-specific parameters to override

Key Requirements

Weight Configuration

  • Each target must have a weight value
  • Weights must be integers between 0 and 100
  • Sum of all weights in a rule must equal 100
  • Weights represent the percentage of traffic sent to each target

Example Configurations

The following example demonstrates a weight-based routing configuration that directs 80% of requests to the Azure GPT-4 model and 20% to the OpenAI GPT-4 model. The when block specifies the conditions under which this rule applies: it matches requests for the gpt-4 model, coming from members of the engineering-team, and only in the production environment. All conditions in the when block must be satisfied for the rule to take effect.
rules:
  - id: "production-rollout"
    type: "weight-based-routing"
    when:
      models: 
        - "gpt-4"
      subjects:
        - "team:engineering-team"
      metadata:
        - "environment:production"
    load_balance_targets:
      - target: "azure/gpt4"
        weight: 80
      - target: "openai/gpt4"
        weight: 20

Best Practices

  1. Weight Validation: Always ensure weights sum to 100 for each rule
  2. Gradual Rollouts: Start with small weights for new models and gradually increase
  3. Monitoring: Monitor performance metrics for each target to adjust weights
  4. Fallback Strategy: Configure fallback candidates for robust error handling
  5. Testing: Test configurations in staging environments before production

Use Cases

  • Gradual Model Rollouts: Start with 10% traffic to new models
  • A/B Testing: Compare different model providers or configurations
  • Load Distribution: Balance traffic across multiple providers
  • Cost Optimization: Route to cheaper models for non-critical requests
  • Performance Optimization: Route to faster models for time-sensitive requests