Truefoundry Docs

Using Azure CLI

1. Install Azure CLI

You can check Installation of Azure CLI from here for your preferred workstation - https://learn.microsoft.com/en-us/cli/azure/install-azure-cli. Example for MacOS

brew update && brew install azure-cli

# confirm the CLI version
az version
{
  "azure-cli": "2.50.0",
  "azure-cli-core": "2.50.0",
  "azure-cli-telemetry": "1.0.8",
  "extensions": {}
}

2. Log In to Azure

You can use multiple methods to login through the Azure CLI

# browser based login
az login

# with username and password
az login --user <user> --password <pass>

# login with the tenant
az login --tenant <tenantID>

# check Azure login help for other methods to log in
az login --help

3. Get the necessary details for the next steps

Subscription details - Check if the current subscription you want is correct or not
az account show
Region - Check the region where you want to deploy your k8s cluster. Also make sure that GPU workloads are available in your region. You can check types of GPU instances and Availability of GPU instances in your specific region. To get a list of all regions
az account list-locations -o table

Make sure to use the second column as the region name in the above output.

Resource group - A new resource group is recommended. However an existing resource group can also be used.
- To create a new resource with some tags, execute the command in step 5.

4. Export all the necessary details into a variable

## subscription details
export SUBSCRIPTION_ID=""
export SUBSCRIPTION_NAME=""

# location
export LOCATION=""

# resource group
export RESOURCE_GROUP=""

# name of the user assigned identity (step 6)
export USER_ASSIGNED_IDENTITY=""

# name of the cluster
export CLUSTER_NAME=""

Set the subscription ID for all the below steps and add the aks-preview extension.

az account set --subscription $SUBSCRIPTION_ID
az extension add --name aks-preview
az extension update --name aks-preview

5. Creating Resource group

All the Azure resources (mostly) are deployed in some resource group. For our AKS cluster we will create a resource group. We are naming it as tfy-datascience but feel free to name it according to your preferred naming conventions. We are creating two tags team=datascience and owner=truefoundry

RESOURCE_GROUP_ID=$(az group create --name $RESOURCE_GROUP \
--location $LOCATION \
--tags team=datascience owner=truefoundry \
--query 'id' --output tsv)

6. Create user assigned identity

To authenticate to AKS cluster post-creation we need to create a user-assigned identity. Managed Identity is the way to authenticate to Azure resource (AKS here) using Azure AD. There are two kinds of managed identities and we will use user-assigned identities among them. Copy the unique ID of the user assigned identity from the below steps

USER_ASSIGNED_IDENTITY_ID=$(az identity create \
--resource-group $RESOURCE_GROUP \
--name $USER_ASSIGNED_IDENTITY \
--query 'id' --output tsv)

7. Creating AKS Cluster

We can create AKS cluster in mostly two ways. You can chose any one of the following ways.

A. Creating AKS cluster without specifying network requirements

In this we can skip the network requirements during AKS creation as it is handled automatically by Azure. We are using tfy-aks-cluster as the cluster name and node pool size will autoscale from 2 to 4 nodes. You need to pass the user assigned identity through the argument --assign-identity

az aks create \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--enable-workload-identity \
--enable-managed-identity \
--assign-identity $USER_ASSIGNED_IDENTITY_ID \
--enable-oidc-issuer \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--kubernetes-version 1.26 \
--location $LOCATION \
--min-count 1 \
--max-count 2 \
--node-count 1 \
--network-plugin kubenet \
--node-vm-size Standard_D2s_v5 \
--node-osdisk-size 100 \
--nodepool-labels class.truefoundry.io=initial \
--nodepool-taints CriticalAddonsOnly=true:NoSchedule \
--enable-node-restriction \
--generate-ssh-keys \
--tags team=datascience owner=truefoundry

Get the kubeconfig file for the AKS cluster

az aks get-credentials --resource-group $RESOURCE_GROUP  --name $CLUSTER_NAME

B. Creating AKS cluster with specific network requirements

Exporting the address and the vnet name

You can export the below command

export VNET_NAME=""
export VNET_ADDRESS_PREFIX=""
export SUBNET_ADDRESS_PREFIX=""

You can use this as an example for the default prefixes

export VNET_NAME="tfy-virtual-net"
export VNET_ADDRESS_PREFIX="192.168.0.0/16"
export SUBNET_ADDRESS_PREFIX="192.168.1.0/24"

Creating a virtual network tfy-virtual-net. Make sure to copy the unique ID of the Virtual network created. All the nodes will be part of this virtual network.

VNET_SUBNET_ID=$(az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET_NAME \
--address-prefix $VNET_ADDRESS_PREFIX \
--location $LOCATION \
--subnet-name $VNET_NAME-subnet \
--subnet-prefixes $SUBNET_ADDRESS_PREFIX \
--query 'newVNet.subnets[0].id' --output tsv)

Create an AKS cluster tfy-aks-cluster-with-vnet with the above network. We are using the user assigned identity we created above along with the unique ID of the virtual network. We are again setting the node pool size to autoscale from 2 to 4 nodes.

az aks create \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--enable-workload-identity \
--enable-managed-identity \
--assign-identity $USER_ASSIGNED_IDENTITY_ID \
--network-plugin kubenet \
--enable-oidc-issuer \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--kubernetes-version 1.26 \
--location $LOCATION \
--min-count 1 \
--max-count 2 \
--node-count 1 \
--node-vm-size Standard_D2s_v5 \
--node-osdisk-size 100 \
--nodepool-labels class.truefoundry.io=initial \
--nodepool-taints CriticalAddonsOnly=true:NoSchedule \
--enable-node-restriction \
--vnet-subnet-id $VNET_SUBNET_ID \
--service-cidr 10.0.0.0/16 \
--dns-service-ip 10.0.0.10 \
--pod-cidr 10.244.0.0/16 \
--docker-bridge-address 172.17.0.1/16 \
--generate-ssh-keys \
--tags team=datascience owner=truefoundry

Getting the kubeconfig file for the AKS cluster

az aks get-credentials --resource-group $RESOURCE_GROUP  --name $CLUSTER_NAME

az aks create commanad Error: unrecognized arguments: —enable-node-restriction, —enable-workload-identity

While creation of aks if you face this error it is because of

Attaching a user based spot node pool

It is advised to attach a user spot node pool in AKS to schedule your workloads that can handle interruptions. There are two kinds of node pools available in Azure system and user. System is used to assign AKS related applications and workloads. User is only used to assign workloads. Moreover each of these node pools can also contains instances which are of type on-demand or spot.
The first node pool we created was of type on-demand and now we will create one of type spot.

az aks nodepool add \
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name spotnodepool \
    --priority Spot \
    --eviction-policy Delete \
    --spot-max-price -1 \
    --enable-cluster-autoscaler \
    --enable-encryption-at-host \
    --node-vm-size Standard_D4s_v5 \
    --min-count 1 \
    --node-count 1 \
    --max-count 10 \
    --mode User \
    --node-osdisk-size 100 \
    --tags team=datascience owner=truefoundry \
    --no-wait

Update autoscaling configurations

Run the below command to update the default cluster-autoscaler configurations

az aks update \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --cluster-autoscaler-profile expander=random scan-interval=30s max-graceful-termination-sec=180 max-node-provision-time=5m ok-total-unready-count=0 scale-down-delay-after-add=2m scale-down-delay-after-delete=30s scale-down-unneeded-time=1m scale-down-unready-time=2m scale-down-utilization-threshold=0.3 skip-nodes-with-local-storage=true skip-nodes-with-system-pods=true

Attaching a user based GPU node pool (on-demand) [GPU]

As we deploy machine learning models we might want to deploy the GPU node pool so that we can bring the GPU instances. You can check types of GPU instances and Availability of GPU instances in your specific region. In the below command we have assumed we will use NC6 for the node pools. Make sure to set the min, max and the required count according to the specific needs.

az aks nodepool add \
--cluster-name $CLUSTER_NAME \
--name gpupoolnc6 \
--resource-group $RESOURCE_GROUP \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--node-vm-size Standard_NC6 \
--node-taints nvidia.com/gpu=Present:NoSchedule \
--max-count 2 \
--min-count 1 \
--node-count 1 \
--node-osdisk-size 100 \
--mode user \
--tags team=datascience owner=truefoundry

Read Understanding Azure Node Pools for more details on how to configure your node pools

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

LLM Deployment

LLM Finetuning

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

LLM Tracing

Platform

Deploying On Your Own Cloud

Creating An AKS Cluster Using Azure Cli

Using Azure CLI

1. Install Azure CLI

2. Log In to Azure

3. Get the necessary details for the next steps

4. Export all the necessary details into a variable

5. Creating Resource group

6. Create user assigned identity

7. Creating AKS Cluster

A. Creating AKS cluster without specifying network requirements

B. Creating AKS cluster with specific network requirements

az aks create commanad Error: unrecognized arguments: —enable-node-restriction, —enable-workload-identity

Attaching a user based spot node pool

Update autoscaling configurations

Attaching a user based GPU node pool (on-demand) [GPU]

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

LLM Deployment

LLM Finetuning

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

LLM Tracing

Platform

Deploying On Your Own Cloud

​Using Azure CLI

​1. Install Azure CLI

​2. Log In to Azure

​3. Get the necessary details for the next steps

​4. Export all the necessary details into a variable

​5. Creating Resource group

​6. Create user assigned identity

​7. Creating AKS Cluster

​A. Creating AKS cluster without specifying network requirements

​B. Creating AKS cluster with specific network requirements

​az aks create commanad Error: unrecognized arguments: —enable-node-restriction, —enable-workload-identity

​Attaching a user based spot node pool

​Update autoscaling configurations

​Attaching a user based GPU node pool (on-demand) [GPU]

Using Azure CLI

1. Install Azure CLI

2. Log In to Azure

3. Get the necessary details for the next steps

4. Export all the necessary details into a variable

5. Creating Resource group

6. Create user assigned identity

7. Creating AKS Cluster

A. Creating AKS cluster without specifying network requirements

B. Creating AKS cluster with specific network requirements

az aks create commanad Error: unrecognized arguments: —enable-node-restriction, —enable-workload-identity

Attaching a user based spot node pool

Update autoscaling configurations

Attaching a user based GPU node pool (on-demand) [GPU]