Using Images from NVIDIA NGC Container Registry

Nvidia Container Registry (nvcr.io)

Create a NGC Personal Token

  1. Sign up at https://ngc.nvidia.com/
  2. Generate a Personal Key from https://org.ngc.nvidia.com/setup/personal-keys

Add nvcr.io as Custom Docker Registry

  1. Under Integrations Tab, Click +Add Integration Provider on top right
  2. Under Integrations, select Custom Docker Registry and enter as follows:
    • Registry URL: nvcr.io
    • Username: $oauthtoken
    • Password: Enter the Personal Token you created earlier
  3. Save

Use the Integration - E.g. Deploying Nvidia NIM Container

📘

Save the Personal Access Token as a Secret

We recommend saving the generated token as a Secret on the platform to be able to use it for other purposes

We can now deploy a Nvidia NIM LLM Container for Inference. You can find the list of all Supported Models from the docs page

  1. We will pick the Llama 3.1 8B Instruct model as an example. From the list of models page, click the NGC Catalog link

  2. From the Container page, copy the image tag

  3. Next, Start a new Service deployment on TrueFoundry

  • In the Image Section, add the Image URI we copied from NGC Page
  • Select the nvcr Docker Registry we added earlier
  • Enter 8000 for port
  • Select a GPU
  1. Optionally add Environment Variables (See Configuring NIM docs page)
  1. Submit

Here is the full spec for reference for 2 x Nvidia T4

name: nim-llama31-8b-ins-v03
type: service
image:
  type: image
  image_uri: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
  docker_registry: tenant:custom:nvcr:docker-registry:nvcr-truefoundry
ports:
  - host: <your-host>
    port: 8000
    expose: true
    protocol: TCP
    app_protocol: http
env:
  NGC_API_KEY: tfy-secret://tenant:secret-group:NGC_API_KEY
  NIM_LOG_LEVEL: DEFAULT
  NIM_SERVER_PORT: '8000'
  NIM_JSONL_LOGGING: '1'
  NIM_MAX_MODEL_LEN: '4096'
  NIM_MODEL_PROFILE: vllm-bf16-tp2
  NIM_LOW_MEMORY_MODE: '1'
  NIM_SERVED_MODEL_NAME: llm
  NIM_TRUST_CUSTOM_CODE: '1'
  NIM_ENABLE_KV_CACHE_REUSE: '1'
  NIM_CACHE_PATH: /opt/nim/.cache
labels:
  tfy_model_server: vLLM
  tfy_openapi_path: openapi.json
  tfy_sticky_session_header_name: x-truefoundry-sticky-session-id
replicas: 1
resources:
  node:
    type: node_selector
    capacity_type: on_demand
  devices:
    - name: T4
      type: nvidia_gpu
      count: 2
  cpu_limit: 8
  cpu_request: 6
  memory_limit: 32000
  memory_request: 27200
  shared_memory_size: 24000
  ephemeral_storage_limit: 100000
  ephemeral_storage_request: 20000
workspace_fqn: <your-workspace-fqn>
readiness_probe:
  config:
    path: /v1/health/ready
    port: 8000
    type: http
  period_seconds: 10
  timeout_seconds: 1
  failure_threshold: 3
  success_threshold: 1
  initial_delay_seconds: 0
allow_interception: false
  1. Once Deployed and ready, you can visit /docs route on the endpoint to try it out

Model Caching using a Volume

To ensure fast startup , you can Create a Read Write Many Volume in the same workspace and mount the volume at /opt/nim/.cache (the value of NIM_CACHE_PATH environment variable) to cache the model weights.