Skip to main content
The AI Gateway is like a proxy between your application and the LLM provider. It lies in the critical path of production traffic and hence needs to be highly available and performant. TrueFoundry AI gateway has been designed to be a low-latency, high throughput AI gateway. The gateway is written using the Hono framework which is ultra-fast, minimalistic and is designed for the edge. The architecture of gateway is based on the following principles:
  1. There are no external calls in the pathway of executing a request from the client to the LLM provider (unless we are using cache).
  2. All checks for ratelimits, loadbalancing, authentication and authorization are done in memory.
  3. The logs and metrics are written to a queue in an async manner.
  4. Gateway never fails a request even if the external queue is down.
  5. The gateway is designed to be horizontally scalable and CPU bound so that it scaled nicely as number of requests increases.
  6. It has a control-plane and proxy separation so that multiple gateways can be deployed cross-region and configuration can be managed from a single place.
Diagram of TrueFoundry AI Gateway architecture showing control-plane, gateway pods, and data flow

TrueFoundry AI Gateway Architecture

The key components of the gateway are:
  1. Global Autentication/Licensing Server: This is a global licensing/authentication server. It is used to authenticate every user who logs into the control plane. In case, you are installing the control-plane on your end, the Global Authentication server remains within Truefoundry.
The only information that flows to global licensing server is the emails of the employees logging into the control-plane and the number of requests flowing through the gateway.
  1. UI: This is the frontend for the gateway. It comprises of an LLM playground, dashboard and configuration panels for different components.
  2. Postgres Database: This database is used to store the configuration for the models, users, teams, virtual accounts, rate-limiting config, load-balancing config etc.
  3. Blob Storage: This stores the logs of the LLM gateway which is used for the analytics, metrics, and debugging. This can be S3, GCS, Azure Blob Storage, or any other S3 compatible storage.
  4. NATS: This is the queue which is the bridge between the control-plane and gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks.
  5. Backend-service: This is the microservice that interacts and coordinates with Postgres database, clickhouse, NATS and UI.
  6. Gateway: This is the actual gateway pod which can be deployed with the control-plane or also in a separate cluster. The gateway pods only subscribe to the NATS queue and don’t have any other external dependencies. All the checks for authorization, ratelimiting are done with in-memory checks and no-network calls are made when a LLM request arrives at the gateway.

Request Flow

Here’s how a LLM request flows through the gateway: Diagram illustrating the request flow through the AI Gateway, from user to LLM provider and back The key steps in the request flow are:
1

User makes a request to the AI Gateway

The user will make a request to the LLM gateway with a valid JWT key and in the OpenAI compatible API format.
2

Gateway validates the request

The gateway will validate the request to check if the JWT token is valid. It also validated if the user has access to the model. The gateway subscribes to all the authorization rules and updates from control-plane and performs in-memory checks to ensure that the request is authenticated and authorized.
3

Gateway checks for budget-limiting and rate-limiting rules

The gateway will check for rate-limiting and budget-limiting rules to ensure that the user is not exceeding any of the threshold limits.
4

Gateway selects the model based on the load-balancing rules

Gateway then checks the load-balancing rules to select the appropriate model for the request.
5

Router/Adapter translates the request to the appropriate model

Since the Gateway unifies the API of different LLM providers, it needs to translate the request to the format expected by the target LLM model. It then forwards the translated request to the model.
6

Handle Response from the model

If the model responds with a successful response, the response if forwarded to the user. The request data is sent for logging and metrics collection to the queue in an async manner. If the model responds with an error, the gateway will check if any fallback model is configured. It will then retry the request to the fallback model.
7

Log the request and response

The request and response will be stored in the queue from where it is picked by the backend service and inserted in Clickhouse database (which is backed by BlobStorage). The aggregation service calculates the metrics required for rate-limiting and budget-limiting. The metrics are sent back to the queue to which the gateway subscribes. Hence, all the gateway pods refresh the updated rate-limiting, latency and budget metrics in an async way.

Gateway Performance and Benchmarks

The gateway has been benchmarked to handle 250 RPS on a single pod with 1 CPU and 1 GB of RAM. The key results of the benchmark are: Benchmarking Results:
  • Near-Zero Overhead: TrueFoundry AI Gateway adds only extra 3 ms in latency upto 250 RPS and 4 ms at RPS > 300.
  • High Scalability: AI Gateway can scale without any degradation in performance until about 350 RPS with 1 vCPU & 1 GB machine before the CPU utilisation reaches 100% and latencies start to get affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.
  • There is no visible overhead on the gateway even after applying multiple authorization rules, load-balancing and rate-limiting configs.

🚀 Learn more on AI Gateway Benchmarks here: Read more

FAQ

NATS is used as the message broker between the control-plane and the gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks. NATS also serves as a cache to store the data temporarily since it provides key-value store with TTL functionality.After every request that passes through the gateway, the gateway pod publishes an event to NATS containing the input tokens, output tokens, latency, cost, model name, etc. The control-plane has an aggregator service that subscribes to NATS for all these updates. It then calculates aggregates of latency per model, total cost per user, model, team or any other custom segment defined by the user in the Loadbalancing / ratelimiting configs. The aggregated data is then again published to NATS by the aggregator service. The Gateway pods subscribe to NATS for this aggregated data. The aggregated data is updated in the in-memory store of the gateway pods, using which all the checks for ratelimiting, load-balancing, and budget-limiting are performed.
Truefoundry AI gateway uses the Sliding Window Token Bucket algorithm to enforce rate-limiting. Since the minimum unit of rate-limiting is on a per-minute window for LLM traffic, the gateway maintains a sliding window of 60 seconds to track the number of requests. The gateway also maintains a token bucket for each user, model, team, or any other custom segment defined by the user in the Loadbalancing / ratelimiting configs for every 5 seconds. So to calculate the counter in the last 50 seconds, it sums up the tokens in the last 12 buckets. The older buckets are removed from the sliding window.
The gateway uses the control-plane to fetch all the configuration data for models, users, virtual accounts, ratelimiting, routing configs, SSO configuration when it starts up. It also subscribes to the NATS queue to remain updated about any configuration changes realtime. If the NATS queue is down for some reason and the gateway cannot connect to it, it also tries to get the configuration from the control-plane backend service via HTTP request. If this also fails, the gateway will not startup and the readiness probe of gateway pods will fail - which will instruct Kubernetes to not direct any traffic to the gateway pod. Getting the configuration from control-plane via the REST API call is implemented as an additional fail-safe mechanism in case the control-plane is up partially.If NATS queue is up, the gateway will sync configs in realtime. If NATS queue is down, and backend service is up, the sync will happen only at startup and no further sync will happen till NATS comes back online. Once it comes back up, all configuration will be automatically synced to Gateway and it will be fully reconciled with th control-plane configuration.
The gateway continues to work with whatever configuration it has already fetched from the control-plane and doesn’t go down even if the control-plane is down. Once control-plane is back up, all the configuration that hasn’t synced automatically syncs up with the gateway. If the gateways pods restart while the control-plane is down and both NATS and backend service are down in the control-plane, the gateway pod that has restarted will not startup. We recommend to keep multiple pods of gateway in a deployment so that the chances of all of them restarting when the control-plane is down are minimal. The gateway pods’ readiness probe succeeds only when all configuration has been synced from the control-plane - before that, no traffic is directed by Kubernetes to the gateway pod.
We use NATS with at least once guarantee. The payload in NATS is idempotent - which means that the entire state of configuration is published from the control-plane to NATS. So publishing the same payload multiple times has no impact on the state of the configuration and hence gateway functionality.However, in some failure scenarios, its possible that gateway misses one of the updates from the control-plane. To mitigate this case further, the control-plane publishes the entire configuration every 10 mins. This makes sure that all gateways are synced every 10 mins even if one intermediate update was missed. This ensures eventual consistency and makes sure that there is no configuration drift between control-plane and gateway.
The gateway syncs the SSO configuration from the control-plane via NATS. Once the gateway knows the Issuer URl, we download the public keys to verify the JWT token and cache them in the gateway pod. Every request that hits the gateway is verified using the public key cached and hence doesn’t require any external calls to verify the JWT token. For authorization, the entire map of users, models and their associations are kept in memory in the gateway pod. The authorization check is done in memory and hence doesn’t require any external calls. Since users and models are essentially small strings, this doesn’t take any significant memory and the authorization checks are very fast.
I