Azure
Provisioning Control Plane Infrastructure on Azure
There are steps in this guide where Truefoundry team will have to be involved. Please reach out to [email protected] to get the credentials
Setting up Truefoundry control plane on your own cloud involves creating the infrastructure to support the platform and then installing the platform itself.
Setting up Infrastructure
Requirements
These are the infrastructure components required to set up a production grade Truefoundry control plane.
If you have the below requirements already set up then skip directly to the Installation section
Requirements | Description | Reason for Requirement |
---|---|---|
Kubernetes Cluster | Any Kubernetes cluster will work here - we can also choose the compute-plane cluster itself to install Truefoundry helm chart. | The Truefoundry helm chart will be installed here. |
Azure Flexible Server for PostgreSQL | Postgres >= 13 | The database is used by Truefoundry control plane to store all its metadata. |
Container in Azure Storage Account | Any container bucket reachable from control-plane. | This is used by control-plane to store the intermediate code while building the docker image. |
AzureAD application | AzureAD application for a service principal having read only access to the AKS cluster | This is used to read the node pools created in the AKS cluster for workloads to get deployed on them. |
Egress Access for TruefoundryAuth | Egress access to https://auth.truefoundry.com | This is needed to validate the users logging into Truefoundry so that licensing can be maintained. |
Egress access For Docker Registry | 1 public.ecr.aws 2. quay.io 3. ghcr.io 4. docker.io/truefoundrycloud 5. docker.io/natsio 6. nvcr.io 7. registry.k8s.io | This is to download docker images for Truefoundry, ArgoCD, NATS, ArgoRollouts, ArgoWorkflows, Istio. |
DNS with TLS/SSL | One endpoint to point to the control plane service (something like platform.example.com where example.com is your domain. There should also be a certificate with the domain so that the domains can be accessed over TLS. The control-plane url should be reachable from the compute-plane so that compute-plane cluster can connect to the control-plane Ensure that require_secure_transport is kept OFF | The developers will need to access the Truefoundry UI at domain that is provided here. |
User/ServiceAccount to provision the infrastructure | - azure subscription with billing enabled - Contributor Role to the above Subscription. - Role Based Access Administrator to the above subscription | These are the permissions required by the IAM user in Azure to create the entire control plane components. |
Run Infra Provisioning using OCLI
Prerequisites
-
Install git if not already present.
-
Setup az CLI
-
Install azure cli >= 2.50
-
Log in and set a subscription. Please ensure that the user has Contributor and RBAC admin roles in the Subscription
# login az login # setting the subscription az account set --subscription $SUBSCRIPTION_ID
-
Installing OCLI
- Download the binary using the below command.
curl -H 'Cache-Control: max-age=0' -s "https://releases.ocli.truefoundry.tech/binaries/ocli_$(curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/stable.txt)_darwin_arm64" -o ocli
curl -H 'Cache-Control: max-age=0' -s "https://releases.ocli.truefoundry.tech/binaries/ocli_$(curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/stable.txt)_darwin_amd64" -o ocli
curl -H 'Cache-Control: max-age=0' -s "https://releases.ocli.truefoundry.tech/binaries/ocli_$(curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/stable.txt)_linux_arm64" -o ocli
curl -H 'Cache-Control: max-age=0' -s "https://releases.ocli.truefoundry.tech/binaries/ocli_$(curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/stable.txt)_linux_amd64" -o ocli
- Make the binary executable and move it to
$PATH
sudo chmod +x ./ocli sudo mv ocli /usr/local/bin
- Confirm by running the command
ocli --version
Configuring input config file
- To create a new cluster, you would require your Azure
Subscription
,Location
,Resource Group
. - Run the following command to fill in the inputs interactively
ocli infra-init
- For networking, there are the following possible configurations:
- New resource group & network (Recommended) - This will create a new resource group and a new Virtual network.
- Existing resource group with existing network - You can use an existing resource group and an existing Virtual network.
- Existing resource group with new network - You can use an existing resource group while creating a new Virtual network
- Once all the inputs are filled, an input config file with the name
tfy-config.yaml
would be generated in your current directory - Modify the file to enable control plane installation by setting
azure.tfy_control_plane.enabled: true
. Also modify theazure.tfy_control_plane.subnet_cidr: ""
orazure.tfy_control_plane.subnet_id: ""
for installing control plane components. Below is the sample for the same:
aws: null
azure:
cluster:
name: CLUSTER_NAME
node_pools:
cpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D2ds_v5
max_count: 2
name: cpu
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D4ds_v5
max_count: 2
name: cpu2x
gpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NV6ads_A10_v5
max_count: 2
name: a10
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC4as_T4_v3
max_count: 2
name: t4
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC24ads_A100_v4
max_count: 2
name: a100
initial:
instance_type: Standard_D2ds_v5
name: initial
version: "1.29"
location: eastus
network:
existing: true
subnet_cidr: ""
subnet_id: "/subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/VNET/subnets/SUBNET"
vnet_cidr: ""
vnet_id: "/subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/VNET"
vnet_name: ""
platform_features:
blob_storage:
container_enable_override: false
container_override_name: ""
enabled: true
storage_account_enable_override: false
storage_account_override_name: ""
cloud_integration:
azuread_application_enable_override: false
azuread_application_override_name: ""
enabled: true
container_registry:
container_registry_enable_override: false
container_registry_override_name: ""
enabled: true
enabled: true
resource_group:
existing: true
name: RESOURCE_GROUP
state:
container_name: tfy-tfstate-CLUSTER_NAME-cn-1714629250
resource_group: tfy-tfstate-CLUSTER_NAME-rg-1714629250
storage_account_name: tfytfstateCLUSTER_NAMEsa
storage_account_sku: Standard_GRS
subscription:
id: SUBSCRIPTION_ID
name: SUBSCRIPTION_NAME
tags: {}
tfy_control_plane:
database:
existing_network: true
instance_class: GP_Standard_D4ds_v5
subnet_cidr: ""
subnet_id: "/subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/VNET/subnets/DB_SUBNET"
enabled: true
binaries:
terraform:
binary_path: null
terragrunt:
binary_path: null
gcp: null
provider: azure
aws: null
azure:
cluster:
name: CLUSTER_NAME
node_pools:
cpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D2ds_v5
max_count: 2
name: cpu
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D4ds_v5
max_count: 2
name: cpu2x
gpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NV6ads_A10_v5
max_count: 2
name: a10
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC4as_T4_v3
max_count: 2
name: t4
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC24ads_A100_v4
max_count: 2
name: a100
initial:
instance_type: Standard_D2ds_v5
name: initial
version: "1.29"
location: eastus
network:
existing: false
subnet_cidr: 10.10.0.0/16
subnet_id: ""
vnet_cidr: 10.0.0.0/8
vnet_id: ""
vnet_name: ""
platform_features:
blob_storage:
container_enable_override: false
container_override_name: ""
enabled: true
storage_account_enable_override: false
storage_account_override_name: ""
cloud_integration:
azuread_application_enable_override: false
azuread_application_override_name: ""
enabled: true
container_registry:
container_registry_enable_override: false
container_registry_override_name: ""
enabled: true
enabled: true
resource_group:
existing: true
name: RESOURCE_GROUP
state:
container_name: tfy-tfstate-CLUSTER_NAME-cn-1714629250
resource_group: tfy-tfstate-CLUSTER_NAME-rg-1714629250
storage_account_name: tfytfstateCLUSTER_NAMEsa
subscription:
id: SUBSCRIPTION_ID
name: SUBSCRIPTION_NAME
tags: {}
tfy_control_plane:
database:
existing_network: true
instance_class: GP_Standard_D4ds_v5
subnet_cidr: "10.11.0.0/24"
subnet_id: "/subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/RESOURCE_GROUP/providers/Microsoft.Network/virtualNetworks/VNET/subnets/DB_SUBNET"
enabled: true
binaries:
terraform:
binary_path: null
terragrunt:
binary_path: null
gcp: null
provider: azure
aws: null
azure:
cluster:
name: CLUSTER_NAME
node_pools:
cpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D2ds_v5
max_count: 2
name: cpu
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_D4ds_v5
max_count: 2
name: cpu2x
gpu_pools:
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NV6ads_A10_v5
max_count: 2
name: a10
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC4as_T4_v3
max_count: 2
name: t4
- enable_on_demand_pool: true
enable_spot_pool: true
instance_type: Standard_NC24ads_A100_v4
max_count: 2
name: a100
initial:
instance_type: Standard_D2ds_v5
name: initial
version: "1.29"
location: eastus
network:
existing: false
subnet_cidr: 10.10.0.0/16
subnet_id: ""
vnet_cidr: 10.0.0.0/8
vnet_id: ""
vnet_name: ""
platform_features:
blob_storage:
container_enable_override: false
container_override_name: ""
enabled: true
storage_account_enable_override: false
storage_account_override_name: ""
cloud_integration:
azuread_application_enable_override: false
azuread_application_override_name: ""
enabled: true
container_registry:
container_registry_enable_override: false
container_registry_override_name: ""
enabled: true
enabled: true
resource_group:
existing: false
name: RESOURCE_GROUP
state:
container_name: tfy-tfstate-CLUSTER_NAME-cn-1714629250
resource_group: tfy-tfstate-CLUSTER_NAME-rg-1714629250
storage_account_name: tfytfstateCLUSTER_NAMEsa
subscription:
id: SUBSCRIPTION_ID
name: SUBSCRIPTION_NAME
tags: {}
tfy_control_plane:
enabled: false
binaries:
terraform:
binary_path: null
terragrunt:
binary_path: null
gcp: null
provider: azure
Create the cluster
Run the following command to create the GKE cluster and IAM roles needed to provide access to various infrastructure components as per the inputs configured above.
ocli infra-create --file tfy-config.yaml
This command may take around 30-45 minutes to complete.
In the last step the database credentials will be printed. Make sure to note them down.
Installing TrueFoundry
Pre-requisites
- Installing helm
- Add the following chart repository
helm repo add argocd https://argoproj.github.io/argo-helm helm repo add truefoundry https://truefoundry.github.io/infra-charts/
- Updating helm repo to download the latest local repository index
helm repo update argocd truefoundry
Installing truefoundry helm chart
-
Installing
argocd
helm charthelm upgrade --install argocd argocd/argo-cd -n argocd \ --create-namespace \ --version 6.7.10 \ --set applicationSet.enabled=false \ --set notifications.enabled=false \ --set dex.enabled=false
-
Create
values.yaml
for the truefoundry helm chart. You can refer to the values for more details## @param tenantName Parameters for tenantName ## Tenant Name - This is same as the name of the organization used to sign up ## on Truefoundry ## tenantName: "" ## @param controlPlaneURL Parameters for controlPlaneURL ## URL of the control plane - This is the URL that can be used by workload to access the truefoundry components ## controlPlaneURL: "" ## @param clusterName Name of the cluster ## Name of the cluster that you have created on AWS/GCP/Azure ## clusterName: "" ## @section notebookController parameters ## Notebook Controller is required to power notebooks in Truefoundry ## notebookController: enabled: false defaultStorageClass: "" ## @section tfyAgent parameters tfyAgent: enabled: false ## @param truefoundry.enabled Flag to enable TrueFoundry ## This installs the Truefoundry control plane helm chart. You can make it true ## if you want to install Truefoundry control plane. ## truefoundry: enabled: true ## @param truefoundry.devMode.enabled Flag to enable TrueFoundry Dev mode. Postgres will run ## devMode: enabled: false truefoundryBootstrap: enabled: true truefoundryFrontendApp: replicaCount: 2 istio: virtualservice: enabled: true gateways: ["istio-system/tfy-wildcard"] hosts: - "" tfyWorkflowAdmin: enabled: false database: host: "" name: "" username: "" password: "" ## @param global.tfyApiKey API key for truefoundry ## tfyApiKey: "" ## @param global.truefoundryImagePullConfigJSON JSON config for image pull secret ## truefoundryImagePullConfigJSON: ""
-
Fill the following values
tenantName
- name of the tenant. If you haven't created one. please do it herecontrolPlaneURL
- URL at which to host the platform (for e.g.https://truefoundry.example.com
)clusterName
- name of the cluster
-
For the remaining values
truefoundry.tfyApiKey
- api key to given by TrueFoundry teamtruefoundry.truefoundryImagePullConfigJSON
- Image pull config JSON to be given by TrueFoundry teamtruefoundry.truefoundryFrontendApp.istio.hosts[0]
- control plane URL without protocol
-
Run the following command to install the chart
helm upgrade --install tfy-k8s-azure-aks-inframold \ truefoundry/tfy-k8s-azure-aks-inframold \ -f values.yaml -n argocd
-
Once the helm chart is installed, point the control plane URL to the load balancer's IP address. To get the IP address of the load balancer
kubectl get svc tfy-istio-ingress -n istio-system
-
We will also need the TLS certificates to be passed to the load balancer (in our case istio) to terminate the TLS traffic.
-
Login in the control plane URL with the same credentials used to register the tenant.
Add the compute plane
- Add the same cluster as the compute-plane from the UI and get the cluster token
- Add the token in the values.yaml
## @section tfyAgent parameters tfyAgent: enabled: true clusterToken: ""
- The control plane URL should be reachable to from inside of the k8s cluster as the
tfy-agent
will use the control plane URL to initiate the connection to the control plane. - Helm
helm upgrade --install tfy-k8s-azure-aks-inframold \ truefoundry/tfy-k8s-azure-aks-inframold \ -f values.yaml -n argocd
Adding domain to Load balancer
- We need to add one more domain to the load balancer so that a separate domain can be used to host the workloads only. This domain can be a wildcard (recommended) as well.
- To add the domain
- Point the domain to the load balancer IP address.
- Pass the TLS certificate to istio so that it can terminate the TLS traffic.
- Add the domain in the platform.
Adding integrations
- If you have used ocli to bootstrap your infrastructure then it creates the following additional resources alongwith AKS cluster in your selected resource group. Check the below documents to understand how to create the integrations manually, if not done through OCLI and how to add them to the platform.
- Container registry - How to add container registry to the platform
- Storage account - how to add storage account to the platform
- Container
- Service Principal having read only access to AKS cluster - how to add azure application to TF platform
Updated 10 days ago