Skip to main content
TrueFoundry AI Gateway enables you to apply routing policies at the gateway layer using which you can enable load-balancing, fallback and retries across the models. Here are a few common scenarios when organizations need load balancing while working with LLMs in production.

Why do we need load balancing / fallback / retries ?

Model providers experience outages and downtimes. For e.g. here’s a screenshot of OpenAI’s and Anthropic’s status page from Feb to May 2025.
OpenAI status page showing multiple incidents and outages from February to May 2025

OpenAI Status Page

Anthropic status page showing service disruptions and degraded performance incidents from February to May 2025

Anthropic Status Page

To avoid the downtime of your applications when models go down, a lot of organizations use multiple model providers and configure load balancing to route to the healthy model in case one of the models goes down, hence avoiding any downtime of their applications for their users.
Latency and performance varies on time, region, model and provider. Here’s a graph of the latency variance of a few models over a course of a month.
Line graph showing latency variance of different LLM models over time with significant fluctuations between providers

Latency Variance of models over a course of a month

We want to be able to route dynamically to the model with the lowest latency at any point in time.
A lot of the LLM providers enforce strict rate limits on API usage. Here’s a screenshot of Azure OpenAI’s rate limits:
Azure OpenAI service rate limits table showing TPM (tokens per minute) and RPM (requests per minute) quotas for different models

Azure OpenAI Rate Limits

When these limits are exceeded, requests begin to fail and we want to be able to route to other models to keep our application running.
Testing new models or updates in production carries significant risks. Dynamic load balancing can be used to route a small percentage of traffic to the new model and monitor the performance before routing all the traffic to the new model.

How TrueFoundry AI Gateway solves routing challenges?

TrueFoundry AI gateway enables us to configure load-balancing and fallback rules to enable routing to the most suitable model as defined by the user. The rules can be defined as part of a routing policy in which different rules can be defined for different subset of requests. A few examples are:
  • When request comes to the model gpt-4o, route 90% of the requests to azure/gpt-4o and 10% to openai/gpt-4o.
  • When request comes to the model claude-3-opus, route 100% of the requests to anthropic/claude-3-opus and if there is a failure, fallback the request to anthropic/claude-3-sonnet.
The structure of a routing configuration looks something like this: Routing Config Each rule defines the subset of requests on which the rules apply and the list of target models to which the requests can be routed. The routing strategy can be weight-based, latency-based, or priority-based as described below:

Weight-based Routing

In weight-based routing, user assigns a weight to each target model. The gateway distributes incoming requests to the models in proportion to their assigned weights. For e.g., if a request for the model gpt-4o is received, you can configure 90% of the requests to go to azure/gpt-4o and 10% to openai/gpt-4o.
In case of latency-based routing, the user doesn’t need to define the weights for each of the target models. The gateway will automatically choose the model with the lowest latency. The lowest latency model is chosen based on the following algorithm:
  1. The time per output_token (inter-token latency) is used as a latency metric to evaluate the latency of the models.
  2. To calculate latency, only the requests in the last 20 mins are considered. If the count of requests in the last 20 mins is greater than 100, the last 100 requests are considered. If the count of requests in the last 20 mins is less than 100, all the requests are considered. If there are less than 3 requests in the last 20 mins, the latency is not derived and the model is considered as the fastest model so that more requests can be routed to it to get more data to derive the latency.
  3. Models are considered equally fast if their latency is within 1.2X of the fastest model - this is done to avoid rapid switching of models due to minor difference in latency.
In priority-based routing, the user defines a priority level for each target model. The gateway will route requests to the highest priority model (lowest priority number) that is healthy and available. If the highest priority model fails or becomes unavailable, the gateway will automatically fallback to the next highest priority model. Lower priority numbers indicate higher priority (0 is the highest priority).
The gateway keeps track of the requests / minute and the tokens / minute for each model configured in the gateway. It also keeps track of the failures / minute for each model. Based on these metrics and the values configured by the user in the routing configuration, the gateway will evaluate the best model to which the request can be routed. In case this model fails, the gateway will fallback to the second best model in the targets list.
The rules in a routing configuration are evaluated for a request serially and the first matching rule is applied. Subsequent rules are not evaluated.

Retry and Fallback Mechanisms

For each target model in a rule in the routing configuration, we can configure the following options about retry and fallback:
  • Retry Configuration: Define the number of attempts, delay between retries, and status codes that trigger retries. Default status codes to retry on are 429, 500, 502, 503.
  • Fallback on failure of target: Define status codes that trigger fallback to other targets. Default values of status codes to fallback on are 401, 403, 404, 429, 500, 502, 503.
  • Fallback candidate: Define if the target can be used as a fallback candidate - i.e. if another target model fails, can the request fallback to this target. Default value is true.
To understand more on how to write the configuration for weight-based, latency-based, and priority-based routing, check out the Configuration page.

Configure Load Balancing on Gateway

The configuration can be added through the Config tab in the Gateway interface.
TrueFoundry AI Gateway Config tab showing YAML editor for load balancing configuration

Load Balancing Configuration Interface

You can also store the configuration in YAML file in your Git repository and apply it to the Gateway using the tfy apply command. This enables enforcing a PR review process for any changes in the load balancing configuration.
I