RateLimiting

Rate Limit requests for specified providers or models

RateLimiting configuration allows you to rate limit requests to a specified tokens/requests per minute/hour/day for certain sets of requests. The rate limiting configuration contains an array of rules. Every request is evaluated against the set of rules, and only the first matching rule is applied—subsequent rules are ignored.

For each rule, we have three sections:

  1. when
    1. subjects: An array of user or teams from which requests is originated - for e.g. user:bob, team:team1
    2. models: An array of model ids which will be used to filter the requests. The model ids are the same as what we pass in the model field in the request.
    3. metadata
  2. limit_to: Integer value which along with unit specifies the limit (for e.g. 100000 tokens per minute)
  3. unit: Possible values are requests_per_minute, requests_per_hour, requests_per_day, tokens_per_minute, tokens_per_hour, tokens_per_day

An example of a rate limiting config is as follows:

name: ratelimiting-config
type: gateway-rate-limiting-config
# The rules are evaluated in order, and all matching rules are considered.
# If any one of them causes a rate limit, the corresponding ID will be returned.
rules:
  # Limit all requests to gpt4 model from openai-main account for user:bob to
  # 1000 requests per day
  - id: "openai-gpt4-dev-env"
    when: 
      subjects: ["user:bob"]
      models: ["openai-main/gpt4"]
      metadata:
        env: dev
    limit_to: 1000
    unit: requests_per_day
  # Limit all requests to gpt4 model for team:backend to 20000 tokens per minute
  - id: "openai-gpt4-dev-env"
    when: 
      subjects: ["team:backend"]
      models: ["openai-main/gpt4"]
    limit_to: 20000
    unit: tokens_per_minute
  # Limit all requests to llama model from bedrock in for customer: 
  # Example to 10 requests per minute
  - id: "llama-bedrock-customer1-limit"
    when: 
      models: ["bedrock/llama3"]
      metadata:
        customer-id: customer1
    limit_to: 19
    unit: requests_per_minute
  # Limit all users to have a limit of 1000000 tokens per day 
  - id: "{user}-daily-limit"
    limit_to: 1000000
    unit: tokens_per_day