Benchmarking LLMs

Measure token generation throughput, Time to First Token (TTFT), Inter Token Latency of LLMs via the chat completions API

Deploying the LLM Benchmarking Tool

The LLM Benchmarking Tool allows you to measure key performance metrics of language models, including token generation throughput, Time to First Token (TTFT), and Inter Token Latency.

Deployment via Application Catalog

The simplest way to deploy the LLM Benchmarking Tool is through the Application Catalog:

  1. Navigate to the Application Catalog and select "Benchmark LLM performance"
Application Catalog
  1. Fill in the required deployment information, including Name and Host endpoint for your benchmarking tool.

Model Configuration

Basic Configuration Parameters

Load Test Parameters

  • Number of users (peak concurrency): Maximum number of concurrent users for the load test
  • Ramp up (users started/second): Rate at which new users are added to the test
  • Host: Base URL for LLM API Server with OpenAI compatible endpoints (v1/chat/completions)

Model Settings

  • Tokenizer (HuggingFace tokenizer to use to count tokens): HuggingFace tokenizer identifier used to count tokens in prompts and responses
  • Model (Name of the model in chat/completions payload): Model identifier used in the API request payload

Finding Parameters for TrueFoundry Deployed Models

When using a model deployed on TrueFoundry, you can find the required parameters in the model's deployment spec:

  1. Model Name: Look under the env section for MODEL_NAME

    env:
      MODEL_NAME: nousresearch-meta-llama-3-1-8b-instruct
  2. Host: Find the host URL under the ports section

    ports:
      - host: example-model.truefoundry.tech
  3. Tokenizer: Look for the model_id under artifacts_download.artifacts

    artifacts_download:
      artifacts:
        - type: huggingface-hub
          model_id: NousResearch/Meta-Llama-3.1-8B-Instruct

Finding Parameters for External Models

When using external models (like GPT-4), you'll need to configure the following parameters:

  1. Model Name: Use the model identifier from your provider (e.g., gpt-4, gpt-4o)
  2. Tokenizer: Find an equivalent tokenizer on HuggingFace (e.g., Quivr/gpt-4o)
  3. Host: Your external provider's API endpoint
  4. API Key: Your provider's API key (e.g., OpenAI API key)

Finding Parameters for TrueFoundry LLM Gateway Models

When using models through TrueFoundry's LLM Gateway:

  1. Navigate to the LLM Gateway in your TrueFoundry workspace
  2. Select the model you want to benchmark
  3. Click on the </> Code button to view the API integration code
  4. From the code example, you can find:
    • Host: The base URL in the request (e.g., https://truefoundry.tech/api/llm/api/inference/openai/chat/completions)
    • Model Name: The model identifier in the request payload (e.g., "model": "openai-main/gpt-4o")
    • OpenAI API Key: Generate one using the "Generate API Key" button
    • Tokenizer: Find an equivalent tokenizer on HuggingFace (e.g., Quivr/gpt-4o)

Prompt Configuration

  • Max Output Tokens: Maximum number of tokens allowed in the model's response
  • Use Random Prompts: Whether to use randomly generated prompts for testing
  • Use Single Prompt: Whether to use a single prompt for all test requests
  • Ignore EOS: Whether to ignore end-of-sequence tokens during token counting
  • Prompt Min Tokens: Minimum number of tokens in the input prompt (not used with random or single prompts)
  • Prompt Max Tokens: Maximum number of tokens in the input prompt (not used with random or single prompts)

Viewing Benchmark Results

After running the benchmark, you'll see comprehensive performance metrics displayed in charts:

  • Requests per Second
  • Active Users
  • Tokens per Second
  • Response Time Seconds
  • Response Time First Token (ms)
  • Inter Token Latency (ms)
Benchmarking Results