Latency-based routing automatically monitors the performance of multiple models and routes requests to the fastest available models. This strategy is ideal for optimizing response times and ensuring the best user experience.

Overview

Latency-based routing provides:
  • Automatic performance optimization: Routes to fastest models without manual intervention
  • Dynamic adaptation: Adjusts routing based on real-time performance data
  • Load distribution: Spreads traffic across multiple fast models
  • Performance monitoring: Continuously tracks model response times

Configuration Structure

# Configuration Details
name: string   # Required: Configuration name (e.g. "latency-based-config")                          
type: gateway-load-balancing-config 

# Rules
rules:
  - id: string                      # Required: Unique identifier for the rule                  
    type: "latency-based-routing"    # Required: Must be "latency-based-routing"
    when:                           # Required: Conditions for when to apply this rule        
      subjects: string[]            # Optional: List of user/virtual account identifiers
      models: string[]              # Required: List of model names to match 
      metadata: object              # Optional: Additional matching criteria
    load_balance_targets:           # Required: List of models to route to
      - target: string              # Required: Model identifier
        override_params: object            # Optional: Model-specific parameters to override

Key Requirements

Latency Configuration

  • No weight values needed - system automatically determines routing
  • No priority values needed - performance determines routing order
  • System automatically monitors and routes to fastest models
  • Models are considered “fast” if their response time is within 1.2x of the fastest model
  • System needs at least 3 requests to start evaluating a model’s performance
  • Uses either last 20 minutes or most recent 100 requests, whichever has fewer requests

Example Configurations

This configuration automatically routes Claude-3 requests to the fastest available model.

Multi-Provider Performance Optimization

rules:
  - id: "gpt4-performance"
    type: "latency-based-routing"
    when:
      models:
        - "gpt-4"
    load_balance_targets:
      - target: "openai/gpt4"
      - target: "azure/gpt4"
      - target: "anthropic/claude-3-opus"

How Latency-Based Routing Works

Performance Monitoring

  1. Data Collection: Gateway tracks response times for each model
  2. Performance Window: Uses last 20 minutes or 100 requests (whichever is fewer)
  3. Minimum Sample Size: Requires at least 3 requests before evaluating performance
  4. Fast Model Threshold: Models within 1.2x of fastest are considered “fast”

Routing Logic

  1. Performance Ranking: Models are ranked by average response time
  2. Fast Model Selection: All models within 1.2x of fastest are eligible
  3. Load Distribution: Traffic is distributed across fast models
  4. Dynamic Updates: Routing adjusts as performance changes

Example Performance Scenario

Model A: 500ms average (fastest)
Model B: 550ms average (1.1x - considered fast)
Model C: 650ms average (1.3x - considered slow)
Model D: 700ms average (1.4x - considered slow)

Result: Traffic distributed between Model A and Model B only

Performance Considerations

Warm-up Period

  • New models need at least 3 requests before being evaluated
  • During warm-up, requests may be distributed randomly
  • Performance data builds up over time

Performance Stability

  • Performance windows (20 min/100 requests) provide stability
  • Prevents rapid switching between models
  • Allows for performance trend analysis

Load Distribution

  • Multiple fast models share traffic load
  • Prevents overloading single fastest model
  • Provides redundancy and fault tolerance

Best Practices

  1. Model Selection: Include models with similar capabilities for fair comparison
  2. Monitoring: Track performance metrics and routing decisions
  3. Capacity Planning: Ensure all models can handle expected load
  4. Testing: Test with realistic workloads to understand performance patterns
  5. Fallback Strategy: Configure fallback candidates for robust error handling

Use Cases

  • Performance Optimization: Route to fastest models for time-sensitive applications
  • Multi-Provider Optimization: Compare performance across different AI providers
  • Geographic Optimization: Route to closest/fastest data centers
  • Load Balancing: Distribute traffic across multiple fast models
  • Cost-Performance Balance: Use performance data to optimize cost vs. speed
  • A/B Testing: Compare performance of different model configurations