Optimize LLM performance and reliability through intelligent request distribution
OpenAI Status Page
Anthropic Status Page
Latency Variance of models over a course of a month
Azure OpenAI Rate Limits
gpt-4o
is received, route 90% of the requests to azure/gpt-4o
and 10% of the requests to openai/gpt-4o
. The gateway will then route 90% of the requests to azure/gpt-4o
and 10% of the requests to openai/gpt-4o
.
In case of latency based routing, the user doesn’t need to define the weights for each of the target models. The gateway will automatically choose the model with the lowest latency. The lowest latency model is chosen based on the following algorithm:
tfy apply
command. This enables enforcing a PR review process for any changes in the load balancing configuration.