Truefoundry AI Gateway provides a load balancing feature that allows you to distribute requests across multiple models. Here are a few common scenarios when oganizations need load balancing while working with LLMs in production.

Why do we need load balancing?

1. Service Outages and Downtime: Model providers experience outages and downtimes. For e.g. here’s a screenshot of OpenAI’s and Anthropic’s status page from Feb to May 2025.

OpenAI Status Page

Anthropic Status Page

To avoid the downtime of your applications when models go down, a lot of organizations use multiple model providers and configure load balancing to route to the healthy model in case one of the models goes down, hence avoiding any downtime of their applications for their users.

2. Latency Variance among models: Latency and performance varies on time, region, model and provider. Here’s a graph of the latency variance of a few models over a course of a month.

Latency Variance of models over a course of a month

We want to be able to route dynamically to the model with the lowest latency at any point in time.

3. Rate Limits of Models: A lot of the LLM providers enforce strict rate limits on API usage. Here’s a screenshot of Azure OpenAI’s rate limits:

Azure OpenAI Rate Limits

When these limits are exceeded, requests begin to fail and we want to be able toroute to other models to keep our application running.

4. Canary Testing: Testing new models or updates in production carries significant risks. Dynamic load balancing can be used to route a small percentage of traffic to the new model and monitor the performance before routing all the traffic to the new model.

How Truefoundry AI Gateway solves these challenges

Truefoundry AI gateway enables us to configure load-balancing rules to enable routing to the most suitable model as defined by the user. It can automatically route to the model with the lowest latency or to the model with the least failures. This can help provide a much higher availability and performance for your application.

How is loadbalancing done in Truefoundry AI Gateway?

The gateway keeps track of the requests / minute and the tokens / minute for each model configured in the gateway. It also keeps track of the failures / minute for each model. Based on these metrics and the values configured by the user in the load balancing configuration, the gateway will first evaluate the list of healthy models to which a request can be routed. Among the list of healthy models, the gateway will then choose the model based on the load balancing strategy configured by the user - which can either be weight based or latency based.

In weight based routing, the user defines the weights for each of the target models and the gateway will route the requests in the ratio of the weights. For e.g. the user can define that when a request with model name gpt-4o is received, route 90% of the requests to azure/gpt-4o and 10% of the requests to openai/gpt-4o. The gateway will then route 90% of the requests to azure/gpt-4o and 10% of the requests to openai/gpt-4o.

In case of latency based routing, the user doesn’t need to define the weights for each of the target models. The gateway will automatically choose the model with the lowest latency. The lowest latency model is chosen based on the following algorithm:

  1. The time per output_token is used as a latency metric to evaluate the latency of the models.
  2. To calculate latency, only the requests in the last 20 mins are considered. If the count of requests in the last 20 mins is greater than 100, the last 100 requests are considered. If the count of requests in the last 20 mins is less than 100, all the requests are considered. If there are less than 3 requests in the last 20 mins, the latency is not derived and the model is considered as the fastest model so that more requests can be routed to it to get more data to derive the latency.
  3. Models are considered euqally fast if their latency is within 1.2X of the fastest model - this is done to avoid rapid switching of models due to minor difference in latency.

To understand more on how to write the configuration for weight based and latency based routing, check out the Configuration page.

Configure Load Balancing on Gateway

The configuration can be added through the Config tab in the Gateway interface.

You can also store the configuration in YAML file in your Git repository and apply it to the Gateway using the tfy apply command. This enables enforcing a PR review process for any changes in the load balancing configuration.