TrueFoundry AI Gateway enables you to apply routing policies at the gateway layer using which you can enable load-balancing, fallback and retries across the models. Here are a few common scenarios when organizations need load balancing while working with LLMs in production.

Why do we need load balancing / fallback / retries ?

How TrueFoundry AI Gateway solves routing challenges?

TrueFoundry AI gateway enables us to configure load-balancing and fallback rules to enable routing to the most suitable model as defined by the user. The rules can be defined as part of a routing policy in which different rules can be defined for different subset of requests. A few examples are:
  • When request comes to the model gpt-4o, route 90% of the requests to azure/gpt-4o and 10% to openai/gpt-4o.
  • When request comes to the model claude-3-opus, route 100% of the requests to anthropic/claude-3-opus and if there is a failure, fallback the request to anthropic/claude-3-sonnet.
The structure of a routing configuration looks something like this: Routing Config Each rule defines the subset of requests on which the rules apply and the list of target models to which the requests can be routed. The routing strategy can be weight-based, latency-based, or priority-based as described below:

Weight-based Routing

The gateway keeps track of the requests / minute and the tokens / minute for each model configured in the gateway. It also keeps track of the failures / minute for each model. Based on these metrics and the values configured by the user in the routing configuration, the gateway will evaluate the best model to which the request can be routed. In case this model fails, the gateway will fallback to the second best model in the targets list.
The rules in a routing configuration are evaluated for a request serially and the first matching rule is applied. Subsequent rules are not evaluated.

Retry and Fallback Mechanisms

For each target model in a rule in the routing configuration, we can configure the following options about retry and fallback:
  • Retry Configuration: Define the number of attempts, delay between retries, and status codes that trigger retries. Default status codes to retry on are 429, 500, 502, 503.
  • Fallback on failure of target: Define status codes that trigger fallback to other targets. Default values of status codes to fallback on are 401, 403, 404, 429, 500, 502, 503.
  • Fallback candidate: Define if the target can be used as a fallback candidate - i.e. if another target model fails, can the request fallback to this target. Default value is true.
To understand more on how to write the configuration for weight-based, latency-based, and priority-based routing, check out the Configuration page.

Configure Load Balancing on Gateway

The configuration can be added through the Config tab in the Gateway interface.
TrueFoundry AI Gateway Config tab showing YAML editor for load balancing configuration

Load Balancing Configuration Interface

You can also store the configuration in YAML file in your Git repository and apply it to the Gateway using the tfy apply command. This enables enforcing a PR review process for any changes in the load balancing configuration.