Truefoundry Docs

Latency-based routing automatically monitors the performance of multiple models and routes requests to the fastest available models. This strategy is ideal for optimizing response times and ensuring the best user experience.

Overview

Latency-based routing provides:

Automatic performance optimization: Routes to fastest models without manual intervention
Dynamic adaptation: Adjusts routing based on real-time performance data
Load distribution: Spreads traffic across multiple fast models
Performance monitoring: Continuously tracks model response times

Configuration Structure

# Configuration Details
name: string   # Required: Configuration name (e.g. "latency-based-config")                          
type: gateway-load-balancing-config 

# Rules
rules:
  - id: string                      # Required: Unique identifier for the rule                  
    type: "latency-based-routing"    # Required: Must be "latency-based-routing"
    when:                           # Required: Conditions for when to apply this rule        
      subjects: string[]            # Optional: List of user/virtual account identifiers
      models: string[]              # Required: List of model names to match 
      metadata: object              # Optional: Additional matching criteria
    load_balance_targets:           # Required: List of models to route to
      - target: string              # Required: Model identifier
        override_params: object            # Optional: Model-specific parameters to override

Key Requirements

Latency Configuration

No weight values needed - system automatically determines routing
No priority values needed - performance determines routing order
System automatically monitors and routes to fastest models
Models are considered “fast” if their response time is within 1.2x of the fastest model
System needs at least 3 requests to start evaluating a model’s performance
Uses either last 20 minutes or most recent 100 requests, whichever has fewer requests

Example Configurations

This configuration automatically routes Claude-3 requests to the fastest available model.

Multi-Provider Performance Optimization

rules:
  - id: "gpt4-performance"
    type: "latency-based-routing"
    when:
      models:
        - "gpt-4"
    load_balance_targets:
      - target: "openai/gpt4"
      - target: "azure/gpt4"
      - target: "anthropic/claude-3-opus"

How Latency-Based Routing Works

Performance Monitoring

Data Collection: Gateway tracks response times for each model
Performance Window: Uses last 20 minutes or 100 requests (whichever is fewer)
Minimum Sample Size: Requires at least 3 requests before evaluating performance
Fast Model Threshold: Models within 1.2x of fastest are considered “fast”

Routing Logic

Performance Ranking: Models are ranked by average response time
Fast Model Selection: All models within 1.2x of fastest are eligible
Load Distribution: Traffic is distributed across fast models
Dynamic Updates: Routing adjusts as performance changes

Example Performance Scenario

Model A: 500ms average (fastest)
Model B: 550ms average (1.1x - considered fast)
Model C: 650ms average (1.3x - considered slow)
Model D: 700ms average (1.4x - considered slow)

Result: Traffic distributed between Model A and Model B only

Performance Considerations

Warm-up Period

New models need at least 3 requests before being evaluated
During warm-up, requests may be distributed randomly
Performance data builds up over time

Performance Stability

Performance windows (20 min/100 requests) provide stability
Prevents rapid switching between models
Allows for performance trend analysis

Load Distribution

Multiple fast models share traffic load
Prevents overloading single fastest model
Provides redundancy and fault tolerance

Best Practices

Model Selection: Include models with similar capabilities for fair comparison
Monitoring: Track performance metrics and routing decisions
Capacity Planning: Ensure all models can handle expected load
Testing: Test with realistic workloads to understand performance patterns
Fallback Strategy: Configure fallback candidates for robust error handling

Use Cases

Performance Optimization: Route to fastest models for time-sensitive applications
Multi-Provider Optimization: Compare performance across different AI providers
Geographic Optimization: Route to closest/fastest data centers
Load Balancing: Distribute traffic across multiple fast models
Cost-Performance Balance: Use performance data to optimize cost vs. speed
A/B Testing: Compare performance of different model configurations

Get Started

Developer Guide

MCP Registry and Gateway

Observability

Integrations

Deployment

API Reference

Chat

Agent

MCP

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Latency-Based Routing

Overview

Configuration Structure

Key Requirements

Latency Configuration

Example Configurations

Multi-Provider Performance Optimization

How Latency-Based Routing Works

Performance Monitoring

Routing Logic

Example Performance Scenario

Performance Considerations

Warm-up Period

Performance Stability

Load Distribution

Best Practices

Use Cases

Get Started

Developer Guide

MCP Registry and Gateway

Observability

Integrations

Deployment

API Reference

Chat

Agent

MCP

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

​Overview

​Configuration Structure

​Key Requirements

​Latency Configuration

​Example Configurations

​Multi-Provider Performance Optimization

​How Latency-Based Routing Works

​Performance Monitoring

​Routing Logic

​Example Performance Scenario

​Performance Considerations

​Warm-up Period

​Performance Stability

​Load Distribution

​Best Practices

​Use Cases

Overview

Configuration Structure

Key Requirements

Latency Configuration

Example Configurations

Multi-Provider Performance Optimization

How Latency-Based Routing Works

Performance Monitoring

Routing Logic

Example Performance Scenario

Performance Considerations

Warm-up Period

Performance Stability

Load Distribution

Best Practices

Use Cases