Deploy LLMs using Nvidia TensorRT-LLM (TRT-LLM)

In this guide, we will deploy a Llama based LLM using Nvidia's TensorRT-LLM runtime. Specfically for this example we will deploy meta-llama/Llama-3.1-8B-Instruct with bf16 precision on 2 x L40S GPUs but you can choose whatever GPU configuration available to you

🚧

Quantization Support coming soon!

1. Add your Huggingface Token as a Secret

Since we are going to deploy the official Llama 3.1 8B Instruct model, we'd need a Huggingface Token that has access to the model.

Visit the model page and fill the access form. You'd get access to the model in 10-15 mins.

Now, Create a token following the HuggingFace Docs and Add it as a secret which we can use later

2. Create a ML Repo and a Workspace with access to ML Repo

Follow the docs at Creating a ML Repo to create a ML Repo backed by your Storage Integration. We will use this to store and version TRT-LLM engines.

Create a Workspace or use and existing Workspace and Grant it Editor access to the ML Repo

3. Deploy the Engine Builder Job

  1. Save the following YAML in a file called builder.truefoundry.yaml
name: trtllm-engine-builder-l40sx2
type: job
image:
  type: image
  image_uri: truefoundrycloud/trtllm-engine-builder:0.13.0
  command: >-
    python build.py
    --model-id {{model_id}}
    --truefoundry-ml-repo {{ml_repo}}
    --truefoundry-run-name {{run_name}}
    --model-type {{model_type}}
    --dtype {{dtype}}
    --max-batch-size {{max_batch_size}}
    --max-input-len {{max_input_len}}
    --max-num-tokens {{max_num_tokens}}
    --kv-cache-type {{kv_cache_type}}
    --workers {{workers}}
    --gpt-attention-plugin {{gpt_attention_plugin}}
    --gemm-plugin {{gemm_plugin}}
    --nccl-plugin {{nccl_plugin}}
    --context-fmha {{context_fmha}}
    --remove-input-padding {{remove_input_padding}}
    --reduce-fusion {{reduce_fusion}}
    --enable-xqa {{enable_xqa}}
    --use-paged-context-fmha {{use_paged_context_fmha}}
    --multiple-profiles {{multiple_profiles}}
    --paged-state {{paged_state}}
    --use-fused-mlp {{use_fused_mlp}}
    --tokens-per-block {{tokens_per_block}}
    --delete-hf-model-post-conversion
trigger:
  type: manual
env:
  PYTHONUNBUFFERED: '1'
  HF_TOKEN: "YOUR-HF-TOKEN-SECRET-FQN"
params:
  - name: model_id
    default: meta-llama/Meta-Llama-3.1-8B-Instruct
    param_type: string
  - name: ml_repo
    default: trtllm-models
    param_type: ml_repo
  - name: run_name
    param_type: string
  - name: model_type
    default: llama
    param_type: string
  - name: dtype
    default: bfloat16
    param_type: string
  - name: max_batch_size
    default: "256"
    param_type: string
  - name: max_input_len
    default: "4096"
    param_type: string
  - name: max_num_tokens
    default: "8192"
    param_type: string
  - name: kv_cache_type
    default: paged
    param_type: string
  - name: workers
    default: "2"
    param_type: string
  - name: gpt_attention_plugin
    default: auto
    param_type: string
  - name: gemm_plugin
    default: auto
    param_type: string
  - name: nccl_plugin
    default: auto
    param_type: string
  - name: context_fmha
    default: enable
    param_type: string
  - name: remove_input_padding
    default: enable
    param_type: string
  - name: reduce_fusion
    default: enable
    param_type: string
  - name: enable_xqa
    default: enable
    param_type: string
  - name: use_paged_context_fmha
    default: enable
    param_type: string
  - name: multiple_profiles
    default: enable
    param_type: string
  - name: paged_state
    default: enable
    param_type: string
  - name: use_fused_mlp
    default: disable
    param_type: string
  - name: tokens_per_block
    default: "64"
    param_type: string
retries: 0
resources:
  node:
    type: node_selector
    capacity_type: on_demand
  devices:
    - type: nvidia_gpu
      count: 2
      name: L40S
  cpu_request: 6
  cpu_limit: 8
  memory_request: 54400
  memory_limit: 64000
  ephemeral_storage_request: 70000
  ephemeral_storage_limit: 100000

🚧

resources section varies across cloud provider

Based on your cloud provider, the available gpu type and nodepools will be different, you'd need to adjust it before deploying.

  1. Replace YOUR-HF-TOKEN-SECRET-FQN with the Secret FQN we created at the beginning. E.g.
env:
-  HF_TOKEN: YOUR-HF-TOKEN-SECRET-FQN
+  HF_TOKEN: tfy-secret://truefoundry:chirag-personal:HF_TOKEN
  1. Generating correct resources section your configuration.

Start a New Deployment, scroll to Resources section and select the GPU type and Count and click on the Spec button

You can copy this resources section and replace it in builder.truefoundry.yaml

  1. After Setting up CLI, deploy the job by mentioning the Workspace FQN
tfy deploy --file builder.truefoundry.yaml --workspace-fqn <your-workspace-fqn> --no-wait
  1. Once the Job is deployed, Trigger it

Enter a Run Name. You can modify other arguments if needed. Please refer to TRT-LLM Docs to know more about these parameters. Click Submit

  1. When the run finishes, you'd have the tokenizer and engine ready to use under the Run Details section

Engine is available under Models section. Copy the FQN and keep it handy.

Tokenizer is available under Artifacts section. Copy the FQN and keep it handy.

4. Deploy with Nvidia Triton Server

Finally, let's deploy the engine using Nvidia Triton Server as a TrueFoundry Service. Here is the spec in full:

🚧

resources section varies across cloud provider

Based on your cloud provider, the available gpu type and nodepools will be different, you'd need to adjust it before deploying.

name: llama-3-1-8b-instruct-trt-llm
type: service
env:
  DECOUPLED_MODE: 'True'
  BATCHING_STRATEGY: inflight_fused_batching
  ENABLE_KV_CACHE_REUSE: 'True'
  TRITON_MAX_BATCH_SIZE: '64'
  BATCH_SCHEDULER_POLICY: max_utilization
  ENABLE_CHUNKED_CONTEXT: 'True'
  KV_CACHE_FREE_GPU_MEM_FRACTION: '0.95'
image:
  type: image
  image_uri: docker.io/truefoundrycloud/tritonserver:24.09-trtllm-python-py3
ports:
  - port: 8000
    expose: false
    protocol: TCP
    app_protocol: http
labels:
  tfy_openapi_path: openapi.json
replicas: 1
resources:
  node:
    type: node_selector
    capacity_type: on_demand
  devices:
    - name: L40S
      type: nvidia_gpu
      count: 2
  cpu_limit: 8
  cpu_request: 6
  memory_limit: 64000
  memory_request: 54400
  shared_memory_size: 12000
  ephemeral_storage_limit: 100000
  ephemeral_storage_request: 20000
liveness_probe:
  config:
    path: /health/ready
    port: 8000
    type: http
  period_seconds: 10
  timeout_seconds: 1
  failure_threshold: 8
  success_threshold: 1
  initial_delay_seconds: 15
readiness_probe:
  config:
    path: /health/ready
    port: 8000
    type: http
  period_seconds: 10
  timeout_seconds: 1
  failure_threshold: 5
  success_threshold: 1
  initial_delay_seconds: 15
artifacts_download:
  artifacts:
    - type: truefoundry-artifact
      artifact_version_fqn: artifact:truefoundry/trtllm-models/tokenizer-meta-llama-3-1-instruct-l40sx2:1
      download_path_env_variable: TOKENIZER_DIR
    - type: truefoundry-artifact
      artifact_version_fqn: model:truefoundry/trtllm-models/trt-llm-engine-meta-llama-3-1-instruct-l40sx2:1
      download_path_env_variable: ENGINE_DIR
  1. Adjust the resources section like we did for the builder Job

❗️

GPU configuration must be same as the Builder Job

Since TRT-LLM optimizes the model for the target GPU type and counts - it is important that the GPU type and count matches while deploying.

  1. In artifacts_download, you'd need to change artifact_version_fqn to the tokenizer and engine obtained at the end of the Job Run from previous section
  2. Deploy the Service by mentioning the Workspace FQN
tfy deploy --file server.truefoundry.yaml --workspace-fqn <your-workspace-fqn> --no-wait
  1. Once deployed, we'll make some final adjustments by editing the Service

    • From Ports section Enable Expose and configure Endpoint as needed


5. Run Inferences

You can now send a payload like follows to /v1/chat/completions Endpoint via the OpenAPI tab

{
  "model": "ensemble", 
  "messages": [
    {"role": "user", "content": "Describe the moon in 100 words"}
  ],
  "max_tokens": 128
}

Since the endpoint is OpenAI compatible, you can Add it to LLM Gateway or use ir directly with OpenAI SDK