Ratelimiting is an important feature that is needed in a lot of scenarios when managing LLM workloads. A few examples of the usecases are:

  1. Control cost per developer/team/application: Its really easy to blow up cost in LLMs because of a bug in the code somewhere or an agent stuck in an infinite loop. Hence, a good safety measure is to limit the cost per developer so that we don’t incur such costly mistakes.

  2. Ratelimit self-hosted LLMs: Often times, companies deploy models on their own GPUs (on-prem or cloud). However, we do want to burst to the cloud per-token API calls in case there is a sudden surge in traffic and there are not enough GPUs left to serve the requests on-prem. In this case, to avoid overwhelming the on-prem GPU, its good to setup a rate limit on the on-prem LLM.

  3. Ratelimit your customers based on their tier: Many products have different tiers of customers, each with a different limit on LLM usage. This can be directly modeled using a rate-limit configuration where in they can set a limit per customer.

Configure RateLimiting in Truefoundry AI Gateway

Using the ratelimiting feature, you can rate limit requests to a specified tokens/requests per minute/hour/day for certain sets of requests. The rate limiting configuration is defined as a YAML file which has the following fields:

  1. name: The name of the rate limiting configuration - it can be anything and is only used for reference in logs.
  2. type: This should be gateway-rate-limiting-config. It helps Truefoundry identify that this is a rate limiting configuration file.
  3. rules: An array of rules.

The rate limiting configuration contains an array of rules. Every request is evaluated against the set of rules, and only the first matching rule is applied—subsequent rules are ignored. So keep generic ones at bottom, specialised configs at top.

For each rule, we have four sections:

  1. id: A unique identifier for the rule. Only used for reference in logs and metrics.
    • You can use dynamic values like {user}, {model} which will be replaced by actual user or model in the request.
      1. If you set the ID as {user}-daily-limit, the system will create a separate rule for each user (for example, alice-daily-limit, bob-daily-limit) and apply the limit individually to each one.
      2. If you set the ID as just daily-limit (without placeholders), the rule will apply collectively to the total number of requests from all users included in the when block.
  2. when (Define the subset of requests on which the rule applies): Truefoundry AI gateway provides a very flexible configuration to define the exact subset of requests on which the rule applies. We can define based on the user calling the model, or the model name or any of the custom metadata key present in the request header X-TFY-METADATA. The subjects, models and metadata fields are conditioned in an AND fashion - meaning that the rule will only match if all the conditions are met. If an incoming request doesn’t match the when block in one rule, the next rule will be evaluated.
    • subjects: Filter based on the list of users / teams / virtual accounts calling the model. User can be specified using user:john-doe or team:engineering-team or virtual-account:acct_1234567890.
    • models: Rule matches if the model name in the request matches any of the models in the list.
    • metadata: Rule matches if the metadata in the request matches the metadata in the rule. For e.g. if we specify metadata: {environment: "production"}, the rule will only match if the request has the metadata key environment with value production in the request header X-TFY-METADATA.
  3. limit_to: Integer value which along with unit specifies the limit (for e.g. 100000 tokens per minute)
  4. unit: Possible values are requests_per_minute, requests_per_hour, requests_per_day, tokens_per_minute, tokens_per_hour, tokens_per_day

Let’s say you want to rate limit requests based on the following rules:

  1. Limit all requests to gpt4 model from openai-main account for user:bob@email.com to 1000 requests per day
  2. Limit all requests to gpt4 model for team:backend to 20000 tokens per minute
  3. Limit all requests to gpt4 model for virtualaccount:virtualaccount1 to 20000 tokens per minute
  4. Limit all models to have a limit of 1000000 tokens per day
  5. Limit all users to have a limit of 1000000 tokens per day
  6. Limit all users to have a limit of 1000000 tokens per day for each model

Your rate limit config would look like this:

name: ratelimiting-config
type: gateway-rate-limiting-config
# The rules are evaluated in order, and only the first matching rule is applied, subsequent rules are ignored.
rules:
  # Limit all requests to gpt4 model from openai-main account for user:bob@email.com to
  # 1000 requests per day
  - id: "openai-gpt4-dev-env"
    when: 
      subjects: ["user:bob@email.com"]
      models: ["openai-main/gpt4"]
    limit_to: 1000
    unit: requests_per_day
  # Limit all requests to gpt4 model for team:backend to 20000 tokens per minute
  - id: "openai-gpt4-dev-env"
    when: 
      subjects: ["team:backend"]
      models: ["openai-main/gpt4"]
    limit_to: 20000
    unit: tokens_per_minute
  # Limit all requests to gpt4 model for virtualaccount:virtualaccount1 to 20000 tokens per minute
  - id: "openai-gpt4-dev-env"
    when: 
      subjects: ["virtualaccount:virtualaccount1"]
      models: ["openai-main/gpt4"]
    limit_to: 20000
    unit: tokens_per_minute
  # Limit all models to have a limit of 1000000 tokens per day 
  - id: "{model}-daily-limit"
 		when: {}
    limit_to: 1000000
    unit: tokens_per_day
  # Limit all users to have a limit of 1000000 tokens per day 
  - id: "{user}-daily-limit"
  	when:{}
    limit_to: 1000000
  # Limit all users to have a limit of 1000000 tokens per day for each model 
  - id: "{user}-{model}-daily-limit"
  	when: {}
    limit_to: 1000000
    unit: tokens_per_day

Configure Ratelimit on Gateway

It’s straightforward—simply go to the Config tab in the Gateway, add your configuration, and save.