System Node pool
If you have followed the steps of Creating the AKS cluster a system node pool must have been created which works onon-demand
nodes to deploy all the necessary applications that are required to power the platform. These includes
- Argocd
- Argo rollouts
- Istio
- tfy-agent
One on-demand node pool is always required
BY default the primary node pool of AKS (system node pool) should always be on-demand. It is not possible to create a SPOT node pool initially.User based node pools
User based node pools are used to run the user applications. These pools can be of typeon-demand
or spot
.
Spot node pool
Spot node pools are used to host the user workloads which can tolerate interruptions. As spot instances are the machines which are used from left-overs they can bring significant cost savings in the cloud billing and because of this there are certain applications, dev workloads and un-important job runs which can be promoted to run over spot instances.Creating a SPOT CPU node pool
Spot CPU node pool should be used for the cases where the applications can tolerate significant interruptions. By default TrueFoundry can tolerate interruptions on these applications which are supporting the platform- Prometheus
- Loki
- cert-manager
- argo-workflows
kubernetes.azure.com/scalesetpriority:spot
which means all the pods which you want to get deployed on these spot instances must tolerate this taint. The toleration must happen like this
Creating a spot GPU node pool
Creating a spot GPU node pool is similar to the CPU spot node pool except for two things- Select the right instance size for the GPU workload and make sure you have the required Quotas for GPU instance in your specific region
- Taint of
nvidia.com/gpu=Present:NoSchedule
. It is important to add this taint to avoid non-GPU workloads to get deployed on GPU machines
On-demand or Regular node pools
On-demand or Regular node pools are used for deploying your applications which require a dedicated machine to run. These workloads are important or nearly important to run at the required time. On-demand nodes are generally expensive to their counterparts (spot) but have SLA available on them. As a general practice, on-demand nodes don’t suffer downtime and doesn’t face interruptions. However, It is always to be noted that nodes are ephemeral in nature and upon excess threshold utilisation they can go down.Creating an on-demand CPU node pool
Run the below command in your cluster by selecting the right instance sizeCreating an on-demand GPU node pool
Creating an on-demand GPU node pool is similar to the CPU on-demand node pool except for two things- Select the right instance size for the GPU workload and make sure you have the required Quotas for GPU instance in your specific region
- Taint of
nvidia.com/gpu=Present:NoSchedule
. It is important to add this taint to avoid non-GPU workloads to get deployed on GPU machines
Adding node pools in the platform
By default TrueFoundry requires an Azure AD application which hasReader
access on the AKS cluster to sync all the nodepools. If this Azure AD application is already added to the platform the new nodepools will sync automatically.
Understanding Cost implications of spot and on-demand (regular) node-pools
A huge difference can be observed while analysing the cloud cost for both spot and on-demand where spot nodes seems to be a highly cheap option. Moreover, this brings lot of uncertainty for the node uptime leading to trade-off. Below is a sample collection of nodes running for a month to analyse costs for spot and on-demand machines. All these are pricing based out of South central US region.Priority | Instance type | Compute | GPU | Cost( per month) |
---|---|---|---|---|
Spot | Standard_D2s_v5 | 2vCPU/8 GB RAM | False | $ 10.19 |
On-demand | Standard_D2s_v5 | 2vCPU/8 GB RAM | False | $ 83.95 |
Spot | Standard_NC6 | 6vCPU/56 GB RAM | True | $ 78.84 |
On-demand | Standard_NC6 | 6vCPU/56 GB RAM | True | $ 788.40 |