TrueFoundry AI Gateway provides a load balancing feature that allows you to distribute requests across multiple models. Here are a few common scenarios when organizations need load balancing while working with LLMs in production.

Why do we need load balancing?

1. Service Outages and Downtime: Model providers experience outages and downtimes. For e.g. here’s a screenshot of OpenAI’s and Anthropic’s status page from Feb to May 2025.
OpenAI status page showing multiple incidents and outages from February to May 2025

OpenAI Status Page

Anthropic status page showing service disruptions and degraded performance incidents from February to May 2025

Anthropic Status Page

To avoid the downtime of your applications when models go down, a lot of organizations use multiple model providers and configure load balancing to route to the healthy model in case one of the models goes down, hence avoiding any downtime of their applications for their users. 2. Latency Variance among models: Latency and performance varies on time, region, model and provider. Here’s a graph of the latency variance of a few models over a course of a month.
Line graph showing latency variance of different LLM models over time with significant fluctuations between providers

Latency Variance of models over a course of a month

We want to be able to route dynamically to the model with the lowest latency at any point in time. 3. Rate Limits of Models: A lot of the LLM providers enforce strict rate limits on API usage. Here’s a screenshot of Azure OpenAI’s rate limits:
Azure OpenAI service rate limits table showing TPM (tokens per minute) and RPM (requests per minute) quotas for different models

Azure OpenAI Rate Limits

When these limits are exceeded, requests begin to fail and we want to be able to route to other models to keep our application running. 4. Canary Testing: Testing new models or updates in production carries significant risks. Dynamic load balancing can be used to route a small percentage of traffic to the new model and monitor the performance before routing all the traffic to the new model.

How TrueFoundry AI Gateway solves these challenges

TrueFoundry AI gateway enables us to configure load-balancing rules to enable routing to the most suitable model as defined by the user. It can automatically route to the model with the lowest latency, to the model with the least failures, or based on priority levels. This can help provide a much higher availability and performance for your application.

How is loadbalancing done in TrueFoundry AI Gateway?

The gateway keeps track of the requests / minute and the tokens / minute for each model configured in the gateway. It also keeps track of the failures / minute for each model. Based on these metrics and the values configured by the user in the load balancing configuration, the gateway will first evaluate the list of healthy models to which a request can be routed. Among the list of healthy models, the gateway will then choose the model based on the load balancing strategy configured by the user - which can be weight-based, latency-based, or priority-based.

Weight-based Routing

In weight-based routing, user assigns a weight to each target model. The gateway distributes incoming requests to the models in proportion to their assigned weights.
  • For example, if a request for the model gpt-4o is received:
    • You can configure 90% of the requests to go to azure/gpt-4o and 10% to openai/gpt-4o.
    • The gateway will automatically route requests according to these specified ratios.

Latency-based Routing

In case of latency-based routing, the user doesn’t need to define the weights for each of the target models. The gateway will automatically choose the model with the lowest latency. The lowest latency model is chosen based on the following algorithm:
  1. The time per output_token is used as a latency metric to evaluate the latency of the models.
  2. To calculate latency, only the requests in the last 20 mins are considered. If the count of requests in the last 20 mins is greater than 100, the last 100 requests are considered. If the count of requests in the last 20 mins is less than 100, all the requests are considered. If there are less than 3 requests in the last 20 mins, the latency is not derived and the model is considered as the fastest model so that more requests can be routed to it to get more data to derive the latency.
  3. Models are considered equally fast if their latency is within 1.2X of the fastest model - this is done to avoid rapid switching of models due to minor difference in latency.

Priority-based Routing

In priority-based routing, the user defines a priority level for each target model. The gateway will route requests to the highest priority model (lowest priority number) that is healthy and available. If the highest priority model fails or becomes unavailable, the gateway will automatically fallback to the next highest priority model. Lower priority numbers indicate higher priority (0 is the highest priority).

Retry and Fallback Mechanisms

The gateway supports configurable retry mechanisms for each target:
  • Retry Configuration: Define the number of attempts, delay between retries, and status codes that trigger retries
  • Fallback Configuration: Define status codes that trigger fallback to other targets and specify which targets can be used as fallback candidates
To understand more on how to write the configuration for weight-based, latency-based, and priority-based routing, check out the Configuration page.

Configure Load Balancing on Gateway

The configuration can be added through the Config tab in the Gateway interface.
TrueFoundry AI Gateway Config tab showing YAML editor for load balancing configuration

Load Balancing Configuration Interface

You can also store the configuration in YAML file in your Git repository and apply it to the Gateway using the tfy apply command. This enables enforcing a PR review process for any changes in the load balancing configuration.

Commonly Used Load Balancing Configurations

Here are a few examples of load balancing configurations for different use cases.