Highlights

TrueFoundry LLM Gateway scales seamlessly to 350 RPS on a single replica of 1 unit CPU while using 270 MB of memory. We compared with another gateway product on a similar setup and it failed to scaled beyond 50 RPS
TrueFoundry LLM Gateway only adds an extra latency of 3-5 ms, while the other gateway product adds between 15-30 ms per request.

Load Test Setup

For our load testing experiment, we setup a deployed this fake OpenAI endpoint service using TrueFoundry. The service would simulate OpenAI request and response format without actually producing tokens.

We also deployed the TrueFoundry LLM Gateway and another gateway product both running of a single replica with 1 unit CPU and 1 GB memory.

We added our fake OpenAI provider into both TrueFoundry and the other gateway product. While load testing, we made requests to the fake OpenAI server in 3 different ways:

Setup 1: Directly without using any proxy or gateway

Setup 2: Through the TrueFoundry LLM Gateway deployed on 1 unit CPU and 1 GB memory

Setup 3: Through the other gateway product deployed on 1 unit CPU and 1 GB memory

Median Latency during Load Test

Request Per Second (RPS)	OpenAI direct (Setup 1)	TrueFoundry LLM Gateway (Setup 2)	Another Gateway Product (Setup 3)
10 RPS	73 ms	76 ms (+3 ms)	88 ms (+15 ms)
50 RPS	73 ms	76 ms (+3 ms)	99 ms (+26 ms)
200 RPS	73 ms	76 ms (+3 ms)	Could not scale
300 RPS	73 ms	77 ms (+4 ms)	Could not scale

Observations

TrueFoundry Gateway: Adds only an extra 3 ms in latency up to 250 RPS and 4 ms at RPS > 300. TrueFoundry LLM Gateway was able to scale without any degradation in performance until about 350 RPS (1 vCPU, 1 GB machine) before the CPU utilization reached 100% and latencies started getting affected. With more CPU or more replicas, the LLM Gateway can scale to tens of thousands of requests per second.

More metrics

Setup 1: Direct OpenAI endpoint calling

Setup 2: TrueFoundry LLM Gateway