Architecture
The AI Gateway is like a proxy between your application and the LLM provider. It lies in the critical path of production traffic and hence needs to be highly available and performant. TrueFoundry AI gateway has been designed to be a low-latency, high throughput AI gateway.
The gateway is written using the Hono framework which is ultra-fast, minimalistic and is designed for the edge. The architecture of gateway is based on the following principles:
- There are no external calls in the pathway of executing a request from the client to the LLM provider (unless we are using cache).
- All checks for ratelimits, loadbalancing, authentication and authorization are done in memory.
- The logs and metrics are written to a queue in an async manner.
- Gateway never fails a request even if the external queue is down.
- The gateway is designed to be horizontally scalable and CPU bound so that it scaled nicely as number of requests increases.
- It has a control-plane and proxy separation so that multiple gateways can be deployed cross-region and configuration can be managed from a single place.
TrueFoundry AI Gateway Architecture
The key components of the gateway are:
- UI: This is the frontend for the gateway. It comprises of an LLM playground, dashboard and configuration panels for different components.
- Postgres Database: This database is used to store the configuration for the models, users, teams, virtual accounts, rate-limiting config, load-balancing config etc.
- Clickhouse Database: This stores the logs of the LLM gateway which is used for the analytics, metrics, and debugging.
- NATS: This is the queue which is the bridge between the control-plane and gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks.
- Backend-service: This is the microservice that interacts and coordinates with Postgres database, clickhouse, NATS and UI.
- Gateway: This is the actual gateway pod which can be deployed with the control-plane or also in a separate cluster. The gateway pods only subscribe to the NATS queue and don’t have any other external dependencies. All the checks for authorization, ratelimiting are done with in-memory checks and no-network calls are made when a LLM request arrives at the gateway.
Request Flow
Here’s how a LLM request flows through the gateway:
The key steps in the request flow are:
User makes a request to the AI Gateway
The user will make a request to the LLM gateway with a valid JWT key and in the OpenAI compatible API format.
Gateway validates the request
The gateway will validate the request to check if the JWT token is valid. It also validated if the user has access to the model. The gateway subscribes to all the authorization rules and updates from control-plane and performs in-memory checks to ensure that the request is authenticated and authorized.
Gateway checks for budget-limiting and rate-limiting rules
The gateway will check for rate-limiting and budget-limiting rules to ensure that the user is not exceeding any of the threshold limits.
Gateway selects the model based on the load-balancing rules
Gateway then checks the load-balancing rules to select the appropriate model for the request.
Router/Adapter translates the request to the appropriate model
Since the Gateway unifies the API of different LLM providers, it needs to translate the request to the format expected by the target LLM model. It then forwards the translated request to the model.
Handle Response from the model
If the model responds with a successful response, the response if forwarded to the user. The request data is sent for logging and metrics collection to the queue in an async manner. If the model responds with an error, the gateway will check if any fallback model is configured. It will then retry the request to the fallback model.
Log the request and response
The request and response will be stored in the queue from where it is picked by the backend service and inserted in Clickhouse database (which is backed by BlobStorage). The aggregation service calculates the metrics required for rate-limiting and budget-limiting. The metrics are sent back to the queue to which the gateway subscribes. Hence, all the gateway pods refresh the updated rate-limiting, latency and budget metrics in an async way.
Gateway Performance and Benchmarks
The gateway has been benchmarked to handle 250 RPS on a single pod with 1 CPU and 1 GB of RAM. The key results of the benchmark are:
Benchmarking Results:
- Near-Zero Overhead: TrueFoundry AI Gateway adds only extra 3 ms in latency upto 250 RPS and 4 ms at RPS > 300.
- High Scalability: AI Gateway can scale without any degradation in performance until about 350 RPS with 1 vCPU & 1 GB machine before the CPU utilisation reaches 100% and latencies start to get affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.
- There is no visible overhead on the gateway even after applying multiple authorization rules, load-balancing and rate-limiting configs.