The AI Gateway is like a proxy between your application and the LLM provider. It lies in the critical path of production traffic and hence needs to be highly available and performant. Truefoundry AI gateway has been designed to be a low-latency, high throughput AI gateway.

The gateway is written using the Hono framework which is ultra-fast, minimalistic and is designed for the edge. The architecture of gateway is based on the following principles:

  1. There are no external calls in the pathway of executing a request from the client to the LLM provider (unless we are using cache).
  2. All checks for ratelimits, loadbalancing, authentication and authorization are done in memory.
  3. The logs and metrics are written to a queue in an async manner.
  4. Gateway never fails a request even if the external queue is down.
  5. The gateway is designed to be horizontally scalable and CPU bound so that it scaled nicely as number of requests increases.
  6. It has a control-plane and proxy separation so that multiple gateways can be deployed cross-region and configuration can be managed from a single place.

Truefoundry AI Gateway Architecture

The key components of the gateway are:

  1. UI: This is the frontend for the gateway. It comprises of an LLM playground, dashboard and configuration panels for different components.
  2. Postgres Database: This database is used to store the configuration for the models, users, teams, virtual accounts, rate-limiting config, load-balancing config etc.
  3. Clickhouse Database: This stores the logs of the LLM gateway which is used for the analytics, metrics, and debugging.
  4. NATS: This is the queue which is the bridge between the control-plane and gateway pods. The control-plane publishes all updates to this queue which is instantly propagated to all the gateway pods. This data includes the users, authorization data between users and models, the load-balancing configs, and the aggregated data for each segment and model for the gateway to perform ratelimiting with in-memory checks.
  5. Backend-service: This is the microservice that interacts and coordinates with Postgres database, clickhouse, NATS and UI.
  6. Gateway: This is the actual gateway pod which can be deployed with the control-plane or also in a separate cluster. The gateway pods only subscribe to the NATS queue and don’t have any other external dependencies. All the checks for authorization, ratelimiting are done with in-memory checks and no-network calls are made when a LLM request arrives at the gateway.

Gateway Performance and Benchmarks

The gateway has been benchmarked to handle 250 RPS on a single pod with 1 CPU and 1 GB of RAM. The key results of the benchmark are:

Benchmarking Results:

  • Near-Zero Overhead: TrueFoundry AI Gateway adds only extra 3 ms in latency upto 250 RPS and 4 ms at RPS > 300.
  • High Scalability: AI Gateway can scale without any degradation in performance until about 350 RPS with 1 vCPU & 1 GB machine before the CPU utilisation reaches 100% and latencies start to get affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.
  • There is no visible overhead on the gateway even after applying multiple authorization rules, load-balancing and rate-limiting configs.

🚀 Learn more on AI Gateway Benchmarks here: Read more