Fractional GPUs enable us to allocate multiple workloads to a single GPU which can be useful in the following scenarios:

  1. The workloads take around 2-3 GB of VRAM - so you can allocate multiple replicas of this workload on a single GPU which has around 16GB of VRAM or more.
  2. Each workload has sparse traffic and its not able to max out the GPU usage.

There are two ways to use fractional GPUs:

  1. TimeSlicing: In this approach, you can slice a GPU into a few fixed number of fractional parts and then choose a fraction of GPU for this workload. For e.g, we can decide to divide a GPU into 10 slices and then request 3 slices for one workload, 5 slices for workload2 and 2 slices for the workload3. This means that worload1 will use 0.3 GPU (compute + 30% of VRAM), workload2 will use 0.5 GPU and workload3 will use 0.2 GPU. However, timeslicing is only used for scheduling the workloads on the same GPU - it doesn’t mean any actual isolation on the machine. For example, if the GPU machine has 16GB VRAM, its on the user to actually make sure that workload1 takes less than 4.8GB VRAM, workload2 takes less than 8 GB of VRAM and workload3 takes less than 3.2 GB of VRAM. If one workload starts taking more memory suddenly, it can lead to crashing of the other processes. The compute is also shared but one workload can go upto using the complete GPU if the other workloads are idle - its basically context-switching among the three workloads. You can read about this more here.
  2. MIG (Multi-Instance GPUs): This is a feature provided by Nvidia only on the A100, H100 and newer generation GPUs - it doesn’t work on other GPUs. We can divide the GPUs into a fixed number of configurable parts as mentioned in the table below. The workloads can choose one of the slices and they will get compute and memory isolation. The instances are not exactly the complete fractions of the GPU, but more discrete units as mentioned in the table below. For e.g. let’s say we divide one A100 GPU of 40GB into 7 parts - then we can place 7 workloads each using around 1/7 GPU and 5GB VRAM. Please note that we cannot simply provide 2 slices to one workload in this case and expect it to get 2/7 GPU and 10 GB VRAM. Each workload can only get one slice in this case. Here are the different MIG profiles:

A brief comparison chart between TimeSlicing and MIG is as follows:

FeatureTimeSlicingMIG (Multi-Instance GPU)
GPU SupportWorks on most GPUs.Only supported on NVIDIA A100 and H100 GPUs.
IsolationNo real isolation. User is responsible for memory management. Potential for crashes if one workload exceeds its allocated memory.Strong Isolation. Compute and memory are isolated between instances. Guaranteed resource allocation.
Resource AllocationDivides GPU into fractional parts (e.g., 0.3, 0.5, 0.2). Workloads can use these fractional parts.Divides GPU into pre-defined, discrete instance types (as per NVIDIA’s configurations). Workloads are assigned entire instances.
VRAM ManagementUser-managed. VRAM allocation is not enforced by the hardware.Hardware-enforced. Each instance has dedicated VRAM.
Compute SharingCompute is shared via context-switching. Workloads can potentially use the entire GPU when others are idle.Compute is partitioned and isolated to each instance. No sharing of compute resources beyond the instance’s allocation.
FlexibilityMore flexible in terms of resource allocation fractions (e.g., can request 0.3, 0.5, etc.).Limited to NVIDIA’s pre-defined instance types. Less flexible in terms of fine-grained resource requests.

Deploying on Fractional GPUs via MIG

1

Create a Nodepool with MIG enabled

We will need to create a separate nodepool for MIG enabled GPUs. Every GPU has different MIG profiles as mentioned in this page: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html. For e.g. here are the MIG profiles for A100 GPU:

GPUGPU Compute Fraction / InstanceNumber of instances per GPUGPU Memory / InstanceConfiguration NameGPU Instance Profile (for Azure)
A100 (40GB)1/775GB1g.5gbMIG1g
A100 (40GB)2/7310GB2g.10gbMIG2g
A100 (40GB)3/7220GB3g.20gbMIG3g
A100 (80GB)1/7710GB1g.10gbMIG1g
A100 (80GB)2/7320GB2g.20gbMIG2g
A100 (80GB)3/7240GB3g.40gbMIG3g

While creating the nodepool, we will need to select the MIG profile. Here are the steps to do it in different cloud providers:

Create a Nodepool with MIG enabled using the argument --gpu-instance-profile of Azure CLI.

Shell
az aks nodepool add \
--cluster-name <your cluster name> \
--resource-group <your resource group> \
--no-wait \
--enable-cluster-autoscaler \
--eviction-policy Delete \
--node-count 0 \
--max-count 20  \
--min-count 1 \
--node-osdisk-size 200 \
--scale-down-mode Delete \
--os-type Linux \
--node-taints "nvidia.com/gpu=Present:NoSchedule" \
--name a100mig7 \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--os-sku Ubuntu \
--gpu-instance-profile MIG1g
2

Deploy your workload on the MIG nodepool

Once you have created the nodepool, you will be able to see the MIG nodepool in the available nodepools in Resources section in deployment.

It might take 10 mins for the newly created nodepool to be visible on the Truefoundry UI. You can force sync the nodepools by going to Platform -> Clusters -> Sync Cluster.

To deploy a workload that utilizes fractional GPU, start deploying your service/job on truefoundry and in the “Resources” section, select nodepool selector. You can now see the Fractional GPUs on the UI which you can select (as shown below)

Using MIG GPU

Deploying on Fractional GPUs via Timeslicing

1

Create a Nodepool with Timeslicing enabled

Create a Nodepool with device-plugin.config pointing to the correct time-slicing config with Azure CLI.

Shell
az aks nodepool add \
--cluster-name <your cluster name> \
--resource-group <your resource group> \
--no-wait \
--enable-cluster-autoscaler \
--eviction-policy Delete \
--node-count 0 \
--max-count 20  \
--min-count 0 \
--node-osdisk-size 200 \
--scale-down-mode Delete \
--os-type Linux \
--node-taints "nvidia.com/gpu=Present:NoSchedule" \
--name a100mig7 \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--os-sku Ubuntu \
--labels nvidia.com/device-plugin.config=time-sliced-10
2

Deploy your workload on the Timeslicing nodepool

Once you have created the nodepool, you will be able to see the timesliced nodepool in the available nodepools in Resources section in deployment.

It might take 10 mins for the newly created nodepool to be visible on the Truefoundry UI. You can force sync the nodepools by going to Platform -> Clusters -> Sync Cluster.

To deploy a workload that utilizes fractional GPU, start deploying your service/job on truefoundry and in the “Resources” section, select nodepool selector. You can now see the timesliced GPUs on the UI which you can select (as shown below)

Using Timeslicing GPU

Fractional GPUs enable us to allocate multiple workloads to a single GPU which can be useful in the following scenarios:

  1. The workloads take around 2-3 GB of VRAM - so you can allocate multiple replicas of this workload on a single GPU which has around 16GB of VRAM or more.
  2. Each workload has sparse traffic and its not able to max out the GPU usage.

There are two ways to use fractional GPUs:

  1. TimeSlicing: In this approach, you can slice a GPU into a few fixed number of fractional parts and then choose a fraction of GPU for this workload. For e.g, we can decide to divide a GPU into 10 slices and then request 3 slices for one workload, 5 slices for workload2 and 2 slices for the workload3. This means that worload1 will use 0.3 GPU (compute + 30% of VRAM), workload2 will use 0.5 GPU and workload3 will use 0.2 GPU. However, timeslicing is only used for scheduling the workloads on the same GPU - it doesn’t mean any actual isolation on the machine. For example, if the GPU machine has 16GB VRAM, its on the user to actually make sure that workload1 takes less than 4.8GB VRAM, workload2 takes less than 8 GB of VRAM and workload3 takes less than 3.2 GB of VRAM. If one workload starts taking more memory suddenly, it can lead to crashing of the other processes. The compute is also shared but one workload can go upto using the complete GPU if the other workloads are idle - its basically context-switching among the three workloads. You can read about this more here.
  2. MIG (Multi-Instance GPUs): This is a feature provided by Nvidia only on the A100, H100 and newer generation GPUs - it doesn’t work on other GPUs. We can divide the GPUs into a fixed number of configurable parts as mentioned in the table below. The workloads can choose one of the slices and they will get compute and memory isolation. The instances are not exactly the complete fractions of the GPU, but more discrete units as mentioned in the table below. For e.g. let’s say we divide one A100 GPU of 40GB into 7 parts - then we can place 7 workloads each using around 1/7 GPU and 5GB VRAM. Please note that we cannot simply provide 2 slices to one workload in this case and expect it to get 2/7 GPU and 10 GB VRAM. Each workload can only get one slice in this case. Here are the different MIG profiles:

A brief comparison chart between TimeSlicing and MIG is as follows:

FeatureTimeSlicingMIG (Multi-Instance GPU)
GPU SupportWorks on most GPUs.Only supported on NVIDIA A100 and H100 GPUs.
IsolationNo real isolation. User is responsible for memory management. Potential for crashes if one workload exceeds its allocated memory.Strong Isolation. Compute and memory are isolated between instances. Guaranteed resource allocation.
Resource AllocationDivides GPU into fractional parts (e.g., 0.3, 0.5, 0.2). Workloads can use these fractional parts.Divides GPU into pre-defined, discrete instance types (as per NVIDIA’s configurations). Workloads are assigned entire instances.
VRAM ManagementUser-managed. VRAM allocation is not enforced by the hardware.Hardware-enforced. Each instance has dedicated VRAM.
Compute SharingCompute is shared via context-switching. Workloads can potentially use the entire GPU when others are idle.Compute is partitioned and isolated to each instance. No sharing of compute resources beyond the instance’s allocation.
FlexibilityMore flexible in terms of resource allocation fractions (e.g., can request 0.3, 0.5, etc.).Limited to NVIDIA’s pre-defined instance types. Less flexible in terms of fine-grained resource requests.

Deploying on Fractional GPUs via MIG

1

Create a Nodepool with MIG enabled

We will need to create a separate nodepool for MIG enabled GPUs. Every GPU has different MIG profiles as mentioned in this page: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html. For e.g. here are the MIG profiles for A100 GPU:

GPUGPU Compute Fraction / InstanceNumber of instances per GPUGPU Memory / InstanceConfiguration NameGPU Instance Profile (for Azure)
A100 (40GB)1/775GB1g.5gbMIG1g
A100 (40GB)2/7310GB2g.10gbMIG2g
A100 (40GB)3/7220GB3g.20gbMIG3g
A100 (80GB)1/7710GB1g.10gbMIG1g
A100 (80GB)2/7320GB2g.20gbMIG2g
A100 (80GB)3/7240GB3g.40gbMIG3g

While creating the nodepool, we will need to select the MIG profile. Here are the steps to do it in different cloud providers:

Create a Nodepool with MIG enabled using the argument --gpu-instance-profile of Azure CLI.

Shell
az aks nodepool add \
--cluster-name <your cluster name> \
--resource-group <your resource group> \
--no-wait \
--enable-cluster-autoscaler \
--eviction-policy Delete \
--node-count 0 \
--max-count 20  \
--min-count 1 \
--node-osdisk-size 200 \
--scale-down-mode Delete \
--os-type Linux \
--node-taints "nvidia.com/gpu=Present:NoSchedule" \
--name a100mig7 \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--os-sku Ubuntu \
--gpu-instance-profile MIG1g
2

Deploy your workload on the MIG nodepool

Once you have created the nodepool, you will be able to see the MIG nodepool in the available nodepools in Resources section in deployment.

It might take 10 mins for the newly created nodepool to be visible on the Truefoundry UI. You can force sync the nodepools by going to Platform -> Clusters -> Sync Cluster.

To deploy a workload that utilizes fractional GPU, start deploying your service/job on truefoundry and in the “Resources” section, select nodepool selector. You can now see the Fractional GPUs on the UI which you can select (as shown below)

Using MIG GPU

Deploying on Fractional GPUs via Timeslicing

1

Create a Nodepool with Timeslicing enabled

Create a Nodepool with device-plugin.config pointing to the correct time-slicing config with Azure CLI.

Shell
az aks nodepool add \
--cluster-name <your cluster name> \
--resource-group <your resource group> \
--no-wait \
--enable-cluster-autoscaler \
--eviction-policy Delete \
--node-count 0 \
--max-count 20  \
--min-count 0 \
--node-osdisk-size 200 \
--scale-down-mode Delete \
--os-type Linux \
--node-taints "nvidia.com/gpu=Present:NoSchedule" \
--name a100mig7 \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--os-sku Ubuntu \
--labels nvidia.com/device-plugin.config=time-sliced-10
2

Deploy your workload on the Timeslicing nodepool

Once you have created the nodepool, you will be able to see the timesliced nodepool in the available nodepools in Resources section in deployment.

It might take 10 mins for the newly created nodepool to be visible on the Truefoundry UI. You can force sync the nodepools by going to Platform -> Clusters -> Sync Cluster.

To deploy a workload that utilizes fractional GPU, start deploying your service/job on truefoundry and in the “Resources” section, select nodepool selector. You can now see the timesliced GPUs on the UI which you can select (as shown below)

Using Timeslicing GPU