Deploy LLMs using Nvidia TensorRT-LLM (TRT-LLM)
In this guide, we will deploy a Llama based LLM using Nvidia's TensorRT-LLM runtime. Specfically for this example we will deploy meta-llama/Llama-3.1-8B-Instruct with bf16 precision on 2 x L40S GPUs but you can choose whatever GPU configuration available to you
Quantization Support coming soon!
1. Add your Huggingface Token as a Secret
Since we are going to deploy the official Llama 3.1 8B Instruct model, we'd need a Huggingface Token that has access to the model.
Visit the model page and fill the access form. You'd get access to the model in 10-15 mins.
Now, Create a token following the HuggingFace Docs and Add it as a secret which we can use later
2. Create a ML Repo and a Workspace with access to ML Repo
Follow the docs at Creating a ML Repo to create a ML Repo backed by your Storage Integration. We will use this to store and version TRT-LLM engines.
Create a Workspace or use and existing Workspace and Grant it Editor access to the ML Repo
3. Deploy the Engine Builder Job
- Save the following YAML in a file called
builder.truefoundry.yaml
name: trtllm-engine-builder-l40sx2
type: job
image:
type: image
image_uri: truefoundrycloud/trtllm-engine-builder:0.13.0
command: >-
python build.py
--model-id {{model_id}}
--truefoundry-ml-repo {{ml_repo}}
--truefoundry-run-name {{run_name}}
--model-type {{model_type}}
--dtype {{dtype}}
--max-batch-size {{max_batch_size}}
--max-input-len {{max_input_len}}
--max-num-tokens {{max_num_tokens}}
--kv-cache-type {{kv_cache_type}}
--workers {{workers}}
--gpt-attention-plugin {{gpt_attention_plugin}}
--gemm-plugin {{gemm_plugin}}
--nccl-plugin {{nccl_plugin}}
--context-fmha {{context_fmha}}
--remove-input-padding {{remove_input_padding}}
--reduce-fusion {{reduce_fusion}}
--enable-xqa {{enable_xqa}}
--use-paged-context-fmha {{use_paged_context_fmha}}
--multiple-profiles {{multiple_profiles}}
--paged-state {{paged_state}}
--use-fused-mlp {{use_fused_mlp}}
--tokens-per-block {{tokens_per_block}}
--delete-hf-model-post-conversion
trigger:
type: manual
env:
PYTHONUNBUFFERED: '1'
HF_TOKEN: "YOUR-HF-TOKEN-SECRET-FQN"
params:
- name: model_id
default: meta-llama/Meta-Llama-3.1-8B-Instruct
param_type: string
- name: ml_repo
default: trtllm-models
param_type: ml_repo
- name: run_name
param_type: string
- name: model_type
default: llama
param_type: string
- name: dtype
default: bfloat16
param_type: string
- name: max_batch_size
default: "256"
param_type: string
- name: max_input_len
default: "4096"
param_type: string
- name: max_num_tokens
default: "8192"
param_type: string
- name: kv_cache_type
default: paged
param_type: string
- name: workers
default: "2"
param_type: string
- name: gpt_attention_plugin
default: auto
param_type: string
- name: gemm_plugin
default: auto
param_type: string
- name: nccl_plugin
default: auto
param_type: string
- name: context_fmha
default: enable
param_type: string
- name: remove_input_padding
default: enable
param_type: string
- name: reduce_fusion
default: enable
param_type: string
- name: enable_xqa
default: enable
param_type: string
- name: use_paged_context_fmha
default: enable
param_type: string
- name: multiple_profiles
default: enable
param_type: string
- name: paged_state
default: enable
param_type: string
- name: use_fused_mlp
default: disable
param_type: string
- name: tokens_per_block
default: "64"
param_type: string
retries: 0
resources:
node:
type: node_selector
capacity_type: on_demand
devices:
- type: nvidia_gpu
count: 2
name: L40S
cpu_request: 6
cpu_limit: 8
memory_request: 54400
memory_limit: 64000
ephemeral_storage_request: 70000
ephemeral_storage_limit: 100000
resources
section varies across cloud providerBased on your cloud provider, the available gpu type and nodepools will be different, you'd need to adjust it before deploying.
- Replace
YOUR-HF-TOKEN-SECRET-FQN
with the Secret FQN we created at the beginning. E.g.
env:
- HF_TOKEN: YOUR-HF-TOKEN-SECRET-FQN
+ HF_TOKEN: tfy-secret://truefoundry:chirag-personal:HF_TOKEN
- Generating correct
resources
section your configuration.
Start a New Deployment, scroll to Resources
section and select the GPU type and Count and click on the Spec
button
You can copy this resources
section and replace it in builder.truefoundry.yaml
- After Setting up CLI, deploy the job by mentioning the Workspace FQN
tfy deploy --file builder.truefoundry.yaml --workspace-fqn <your-workspace-fqn> --no-wait
- Once the Job is deployed, Trigger it
Enter a Run Name. You can modify other arguments if needed. Please refer to TRT-LLM Docs to know more about these parameters. Click Submit
- When the run finishes, you'd have the
tokenizer
andengine
ready to use under the Run Details section
Engine is available under Models
section. Copy the FQN and keep it handy.
Tokenizer is available under Artifacts
section. Copy the FQN and keep it handy.
4. Deploy with Nvidia Triton Server
Finally, let's deploy the engine using Nvidia Triton Server as a TrueFoundry Service. Here is the spec in full:
resources
section varies across cloud providerBased on your cloud provider, the available gpu type and nodepools will be different, you'd need to adjust it before deploying.
name: llama-3-1-8b-instruct-trt-llm
type: service
env:
DECOUPLED_MODE: 'True'
BATCHING_STRATEGY: inflight_fused_batching
ENABLE_KV_CACHE_REUSE: 'True'
TRITON_MAX_BATCH_SIZE: '64'
BATCH_SCHEDULER_POLICY: max_utilization
ENABLE_CHUNKED_CONTEXT: 'True'
KV_CACHE_FREE_GPU_MEM_FRACTION: '0.95'
image:
type: image
image_uri: docker.io/truefoundrycloud/tritonserver:24.09-trtllm-python-py3
ports:
- port: 8000
expose: false
protocol: TCP
app_protocol: http
labels:
tfy_openapi_path: openapi.json
replicas: 1
resources:
node:
type: node_selector
capacity_type: on_demand
devices:
- name: L40S
type: nvidia_gpu
count: 2
cpu_limit: 8
cpu_request: 6
memory_limit: 64000
memory_request: 54400
shared_memory_size: 12000
ephemeral_storage_limit: 100000
ephemeral_storage_request: 20000
liveness_probe:
config:
path: /health/ready
port: 8000
type: http
period_seconds: 10
timeout_seconds: 1
failure_threshold: 8
success_threshold: 1
initial_delay_seconds: 15
readiness_probe:
config:
path: /health/ready
port: 8000
type: http
period_seconds: 10
timeout_seconds: 1
failure_threshold: 5
success_threshold: 1
initial_delay_seconds: 15
artifacts_download:
artifacts:
- type: truefoundry-artifact
artifact_version_fqn: artifact:truefoundry/trtllm-models/tokenizer-meta-llama-3-1-instruct-l40sx2:1
download_path_env_variable: TOKENIZER_DIR
- type: truefoundry-artifact
artifact_version_fqn: model:truefoundry/trtllm-models/trt-llm-engine-meta-llama-3-1-instruct-l40sx2:1
download_path_env_variable: ENGINE_DIR
- Adjust the
resources
section like we did for the builder Job
GPU configuration must be same as the Builder Job
Since TRT-LLM optimizes the model for the target GPU type and counts - it is important that the GPU type and count matches while deploying.
- In
artifacts_download
, you'd need to changeartifact_version_fqn
to the tokenizer and engine obtained at the end of the Job Run from previous section - Deploy the Service by mentioning the Workspace FQN
tfy deploy --file server.truefoundry.yaml --workspace-fqn <your-workspace-fqn> --no-wait
-
Once deployed, we'll make some final adjustments by editing the Service
- From
Ports
section EnableExpose
and configure Endpoint as needed
- (Optional) Configure Download Models and Artifacts to prevent re-downloads
- From
5. Run Inferences
You can now send a payload like follows to /v1/chat/completions
Endpoint via the OpenAPI
tab
{
"model": "ensemble",
"messages": [
{"role": "user", "content": "Describe the moon in 100 words"}
],
"max_tokens": 128
}
Since the endpoint is OpenAI compatible, you can Add it to LLM Gateway or use ir directly with OpenAI SDK
Updated 6 days ago