Routing Policy Configuration for Truefoundry AI Gateway
gateway-load-balancing-config
. It helps the truefoundry platform identify that this is a load balancing configuration file.<companyname>-gateway-load-balancing-config
. The name is only used for logging purposes and doesn’t have any other significance.X-TFY-METADATA
. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.
subjects
: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified using user:john-doe
or team:engineering-team
or virtualaccount:acct_1234567890
.models
: Rule matches if the model name in the request matches any of the models in the list.metadata
: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specify metadata: {environment: "production"}
, the rule will only match if the request has the metadata key environment
with value production
in the request header X-TFY-METADATA
.weight-based-routing
, latency-based-routing
, or priority-based-routing
depending on which strategy to use. To understand how these strategies work, check out how gateway does load balancing.
Models to route traffic to
): This defines the list of models which will be eligible for routing requests for this rule. For each target, we can configure the following options:
attempts
: Number of retry attempts. Default value is 2.delay
: Delay between retries in milliseconds (default value is 100ms)on_status_codes
: List of HTTP status codes that should trigger a retry (Default value is: ["429", "500", "502", "503"]
)fallback_status_codes
: List of HTTP status codes that trigger fallback to other targets (Default value is: ["401", "403", "404", "429", "500", "502", "503"]
)fallback_candidate
: Boolean indicating whether this target can be used as a fallback option for other targets. Default values is true - meaning that this target can be used as a fallback option for other targets.Route all requests to azure/gpt4 by default. If azure/gpt4 is rate limited, route to openai/gpt4 and if that is also rate limited, route to anthropic/claude-3-opus.
Canary Deployment to gradually roll out a new gpt-4 model version.
Route all requests to onprem hosted llama model by default. If there is a sudden burst of traffic that exceeds the rate limits of the on-prem model, fallback to bedrock llama model. If bedrock model fails, retry twice on the model.
Route to the fastest model between azure/gpt4 and openai/gpt4 at any point. If one of them is down, retry once and then fallback to the other model
If user is using gpt-4 in dev environment, then route to dev account. If user is using gpt-4 in prod environment, then route to the fastest model between azure/gpt4 and openai/gpt4
Route user to the closest region model. If closest region model is down, or slow, route to the model in the other regions
Route traffic for paid customers to gpt5 model. For non-paid customers, use gpt4 model.
Use the fastest model between openai/gpt4 and azure/gpt4 by default. If there is a failure on openai/gpt4, retry 2 times. Retry once for azure/gpt4. If both of them are unhealthy because of ratelimiting or any other errors, fallback to bedrock/claude. While calling claude model, change the temperature to 0.5. Apply this only in production for an application called booking-app.