This page provides an overview of the architecture, requirements and steps to install the TrueFoundry compute plane cluster in Azure
The architecture of a TrueFoundry compute plane is as follows:
Access Policies Overview
Policy | Description |
---|---|
Access required for Azure container registry, storage account | An azure container registry is used to store the docker images for the platform. A storage account is used to store the model artifacts. |
Azure AD application with Reader and Monitoring Reader on AKS | Reader and monitoring reader permission on AKS is used to access the cluster autoscaler logs in Log Analytics and read azure node pools. User should have access to create Azure AD application. |
The common requirements to setup compute plane in each of the scenarios is as follows:
public.ecr.aws
, quay.io
, ghcr.io
, tfy.jfrog.io
, docker.io/natsio
, nvcr.io
, registry.k8s.io
so that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc.services.example.com/tfy/*
, however, many frontend applications do not support this. For certificate, check this document for more details.Critical CPU on-demand node pool (2vCPU, 8GB RAM, min 2 nodes)
We need to create one nodepool which is for running truefoundry critical workloads like prometheus,
loki and tfy-agent. We should put the taint class.truefoundry.com/component=critical:NoSchedule
and label class.truefoundry.com/component=critical
on this nodepool. The min-instance count should be 2 and max-instance count should be 5.
CPU on-demand node pools
These nodepools are for running the user deployed applications. We should create 2-3 on-demand CPU nodepools with varying configuration depending on the requirement of the workloads. The min instance count can be configured to 0 and max-instance count can be configured to 10. A few sample instance types that can be chosen for: Standard_D4ds_v5
, Standard_D8ds_v5
, Standard_D16ds_v5
. This depends on the expected usage of the cluster.
CPU spot node pools
These nodepools are for running the user deployed applications. We should create 2-3 spot CPU nodepools with varying configuration depending on the requirement of the workloads. The min instance count can be configured to 0 and max-instance count can be configured to 10. A few sample instance types that can be chosen for: Standard_D4ds_v5
, Standard_D8ds_v5
, Standard_D16ds_v5
. This depends on the expected usage of the cluster.
GPU on-demand node pools
These nodepools are for running the user deployed applications. In case you are planning to use GPU instances, you can create 2-3 on-demand GPU nodepools with different types of GPUs - like with instance types: Standard_NC4as_T4_v3
, Standard_NC24ads_A100_v4
, Standard_NV6ads_A10_v5
. The min instance count should be configured to 0 and max-instance count can be configured to 10.
GPU spot node pools
These nodepools are for running the user deployed applications.In case you are planning to use GPU instances, you can create 2-3 spot GPU nodepools with different types of GPUs - like with instance types: Standard_NC4as_T4_v3
, Standard_NC24ads_A100_v4
, Standard_NV6ads_A10_v5
.The min instance count should be configured to 0 and max-instance count can be configured to 10.
TrueFoundry compute plane infrastructure is provisioned using terraform. You can download the terraform code for your exact account by filling up your account details and downloading a script that can be executed on your local machine.
Choose to create a new cluster or attach an existing cluster
Go to the platform section in the left panel and click on Clusters
. You can click on Create New Cluster
or Attach Existing Cluster
depending on your use case. Read the requirements and if everything is satisfied, click on Continue
.
Fill up the form to generate the terraform code
A form will be presented with the details for the new cluster to be created. Fill in with your cluster details. Click Submit
when done
The key fields to fill up here are:
Region
- The region and availability zones where you want to create the cluster.Resource Group
- The resource group where you want to create the cluster. Chose between New Resource Group
or Existing Resource Group
depending on your use case.Cluster Name
- A name for your cluster.Cluster Version
and node pools
- The version of the cluster and the node pools to be created.Network Configuration
- Choose between New Vnet
or Existing Vnet
depending on your use case.Storage account (container) for Terraform State
- Terraform state will be stored in this container. It can be a preexisting storage account or a new storage account name. The new storage account will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration using Azure AD and Container Registry will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.The key fields to fill up here are:
Region
- The region and availability zones where you want to create the cluster.Resource Group
- The resource group where you want to create the cluster. Chose between New Resource Group
or Existing Resource Group
depending on your use case.Cluster Name
- A name for your cluster.Cluster Version
and node pools
- The version of the cluster and the node pools to be created.Network Configuration
- Choose between New Vnet
or Existing Vnet
depending on your use case.Storage account (container) for Terraform State
- Terraform state will be stored in this container. It can be a preexisting storage account or a new storage account name. The new storage account will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration using Azure AD and Container Registry will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.The key fields to fill up here are:
Region
- The region and availability zones where you want to create the cluster.Resource Group
- The resource group where the cluster is already created.Cluster Name
- Your cluster name.Network Configuration
- Existig Vnet and subnet details.Cluster Addons
- TrueFoundry needs to install addons like ArgoCD, ArgoWorkflows, Keda, Istio, etc. Please disable the addons that are already installed on your cluster so that truefoundry installation does not overrride the existing configuration and affect your existing workloads.Storage account (container) for Terraform State
- Terraform state will be stored in this container. It can be a preexisting storage account or a new storage account name. The new storage account will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration using Azure AD and Container Registry will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.Copy the curl command and execute it on your local machine
You will be presented with a curl
command to download and execute the script. The script will take care of installing the pre-requisites, downloading terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
Verify the cluster is showing as connected in the platform
Once the script is executed, the cluster will be shown as connected in the platform.
Create DNS Record
We can get the load-balancer’s IP address by going to the platform section in the bottom left panel under the Clusters section. Under the preferred cluster, you’ll see the load balancer IP address under the Base Domain URL
section.
Create a DNS record in your route 53 or your DNS provider with the following details
Record Type | Record Name | Record value |
---|---|---|
CNAME | *.tfy.example.com | LOADBALANCER_IP_ADDRESS |
Setup routing and TLS for deploying workloads to your cluster
Follow the instructions here to setup DNS and TLS for deploying workloads to your cluster.
Start deploying workloads to your cluster
You can start by going here
The IAM user should have the following permissions -
Contributor Role to the above Subscription
Role Based Access Administrator to the above subscription
Either Azure AD Administrator or Azure AD Application Developer role to:
There are two ways primarily through we can add TLS to the load balancer in Azure