- There are no external calls in the pathway of executing a request from the client to the LLM provider (unless we are using cache).
- All checks for ratelimits, loadbalancing, authentication and authorization are done in memory.
- The logs and metrics are written to a queue in an async manner.
- Gateway never fails a request even if the external queue is down.
- The gateway is designed to be horizontally scalable and CPU bound so that it scaled nicely as number of requests increases.
- It has a control-plane and proxy separation so that multiple gateways can be deployed cross-region and configuration can be managed from a single place.

TrueFoundry AI Gateway Architecture
- Global Autentication/Licensing Server: This is a global licensing/authentication server. It is used to authenticate every user who logs into the control plane. In case, you are installing the control-plane on your end, the Global Authentication server remains within Truefoundry.
The only information that flows to global licensing server is the emails of the employees logging into the control-plane and the number of requests flowing through the gateway.
- UI: This is the frontend for the gateway. It comprises of an LLM playground, dashboard and configuration panels for different components.
- Postgres Database: This database is used to store the configuration for the models, users, teams, virtual accounts, rate-limiting config, load-balancing config etc.
- Blob Storage: This stores the logs of the LLM gateway which is used for the analytics, metrics, and debugging. This can be S3, GCS, Azure Blob Storage, or any other S3 compatible storage.
- NATS: This is the queue which is the bridge between the control-plane and gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks.
- Backend-service: This is the microservice that interacts and coordinates with Postgres database, clickhouse, NATS and UI.
- Gateway: This is the actual gateway pod which can be deployed with the control-plane or also in a separate cluster. The gateway pods only subscribe to the NATS queue and don’t have any other external dependencies. All the checks for authorization, ratelimiting are done with in-memory checks and no-network calls are made when a LLM request arrives at the gateway.
Request Flow
Here’s how a LLM request flows through the gateway:
1
User makes a request to the AI Gateway
The user will make a request to the LLM gateway with a valid JWT key and in the OpenAI compatible API format.
2
Gateway validates the request
The gateway will validate the request to check if the JWT token is valid. It also validated if the user has
access to the model. The gateway subscribes to all the authorization rules and updates from control-plane and
performs in-memory checks to ensure that the request is authenticated and authorized.
3
Gateway checks for budget-limiting and rate-limiting rules
The gateway will check for rate-limiting and budget-limiting rules to ensure that the user is not exceeding
any of the threshold limits.
4
Gateway selects the model based on the load-balancing rules
Gateway then checks the load-balancing rules to select the appropriate model for the request.
5
Router/Adapter translates the request to the appropriate model
Since the Gateway unifies the API of different LLM providers, it needs to translate the request to the format expected by the target LLM model. It then forwards the translated request to the model.
6
Handle Response from the model
If the model responds with a successful response, the response if forwarded to the user. The request data is sent for logging and metrics collection to the queue in an async manner.
If the model responds with an error, the gateway will check if any fallback model is configured. It will then retry the request to the fallback model.
7
Log the request and response
The request and response will be stored in the queue from where it is picked by the backend service and inserted in Clickhouse database (which is backed by BlobStorage). The aggregation service calculates the metrics required for rate-limiting and budget-limiting. The metrics are sent back to the queue to which the gateway subscribes. Hence, all the gateway pods refresh the updated rate-limiting, latency and budget metrics in an async way.
Gateway Performance and Benchmarks
The gateway has been benchmarked to handle 250 RPS on a single pod with 1 CPU and 1 GB of RAM. The key results of the benchmark are: Benchmarking Results:- Near-Zero Overhead: TrueFoundry AI Gateway adds only extra 3 ms in latency upto 250 RPS and 4 ms at RPS > 300.
- High Scalability: AI Gateway can scale without any degradation in performance until about 350 RPS with 1 vCPU & 1 GB machine before the CPU utilisation reaches 100% and latencies start to get affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.
- There is no visible overhead on the gateway even after applying multiple authorization rules, load-balancing and rate-limiting configs.
FAQ
What role does NATS play in the architecture?
What role does NATS play in the architecture?
NATS is used as the message broker between the control-plane and the gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks. NATS also serves as a cache to store the data temporarily since it provides key-value store with TTL functionality.After every request that passes through the gateway, the gateway pod publishes an event to NATS containing the input tokens, output tokens, latency, cost, model name, etc.
The control-plane has an aggregator service that subscribes to NATS for all these updates. It then calculates aggregates of latency per model, total cost per user, model, team or any other custom segment defined by the user in the Loadbalancing / ratelimiting configs. The aggregated data is then again published to NATS by the aggregator service. The Gateway pods subscribe to NATS for this aggregated data. The aggregated data is updated in the in-memory store of the gateway pods, using which all the checks for ratelimiting, load-balancing, and budget-limiting are performed.
How does the Gateway enforce rate-limiting?
How does the Gateway enforce rate-limiting?
Truefoundry AI gateway uses the Sliding Window Token Bucket algorithm to enforce rate-limiting. Since the minimum unit of rate-limiting is on a per-minute window for LLM traffic, the gateway maintains a sliding window of 60 seconds to track the number of requests. The gateway also maintains a token bucket for each user, model, team, or any other custom segment defined by the user in the Loadbalancing / ratelimiting configs for every 5 seconds. So to calculate the counter in the last 50 seconds, it sums up the tokens in the last 12 buckets. The older buckets are removed from the sliding window.