Benchmarking LLMs
Measure token generation throughput, Time to First Token (TTFT), Inter Token Latency of LLMs via the chat completions API
Deploying the LLM Benchmarking Tool
The LLM Benchmarking Tool allows you to measure key performance metrics of language models, including token generation throughput, Time to First Token (TTFT), and Inter Token Latency.
Deployment via Application Catalog
The simplest way to deploy the LLM Benchmarking Tool is through the Application Catalog:
- Navigate to the Application Catalog and select "Benchmark LLM performance"

- Fill in the required deployment information, including Name and Host endpoint for your benchmarking tool.

Model Configuration
Basic Configuration Parameters
Load Test Parameters
- Number of users (peak concurrency): Maximum number of concurrent users for the load test
- Ramp up (users started/second): Rate at which new users are added to the test
- Host: Base URL for LLM API Server with OpenAI compatible endpoints (v1/chat/completions)
Model Settings
- Tokenizer (HuggingFace tokenizer to use to count tokens): HuggingFace tokenizer identifier used to count tokens in prompts and responses
- Model (Name of the model in chat/completions payload): Model identifier used in the API request payload

Finding Parameters for TrueFoundry Deployed Models
When using a model deployed on TrueFoundry, you can find the required parameters in the model's deployment spec:
-
Model Name: Look under the
env
section forMODEL_NAME
env: MODEL_NAME: nousresearch-meta-llama-3-1-8b-instruct
-
Host: Find the host URL under the
ports
sectionports: - host: example-model.truefoundry.tech
-
Tokenizer: Look for the
model_id
underartifacts_download.artifacts
artifacts_download: artifacts: - type: huggingface-hub model_id: NousResearch/Meta-Llama-3.1-8B-Instruct
Finding Parameters for External Models
When using external models (like GPT-4), you'll need to configure the following parameters:
- Model Name: Use the model identifier from your provider (e.g.,
gpt-4
,gpt-4o
) - Tokenizer: Find an equivalent tokenizer on HuggingFace (e.g.,
Quivr/gpt-4o
) - Host: Your external provider's API endpoint
- API Key: Your provider's API key (e.g., OpenAI API key)
Finding Parameters for TrueFoundry LLM Gateway Models
When using models through TrueFoundry's LLM Gateway:
- Navigate to the LLM Gateway in your TrueFoundry workspace
- Select the model you want to benchmark
- Click on the
</> Code
button to view the API integration code - From the code example, you can find:
- Host: The base URL in the request (e.g.,
https://truefoundry.tech/api/llm/api/inference/openai/chat/completions
) - Model Name: The model identifier in the request payload (e.g.,
"model": "openai-main/gpt-4o"
) - OpenAI API Key: Generate one using the "Generate API Key" button
- Tokenizer: Find an equivalent tokenizer on HuggingFace (e.g.,
Quivr/gpt-4o
)
- Host: The base URL in the request (e.g.,
Prompt Configuration

- Max Output Tokens: Maximum number of tokens allowed in the model's response
- Use Random Prompts: Whether to use randomly generated prompts for testing
- Use Single Prompt: Whether to use a single prompt for all test requests
- Ignore EOS: Whether to ignore end-of-sequence tokens during token counting

- Prompt Min Tokens: Minimum number of tokens in the input prompt (not used with random or single prompts)
- Prompt Max Tokens: Maximum number of tokens in the input prompt (not used with random or single prompts)
Viewing Benchmark Results
After running the benchmark, you'll see comprehensive performance metrics displayed in charts:
- Requests per Second
- Active Users
- Tokens per Second
- Response Time Seconds
- Response Time First Token (ms)
- Inter Token Latency (ms)

Updated about 1 hour ago