Creating an AKS cluster using azure-cli
Creation of kubernetes cluster on Azure
Using Azure CLI
1. Install Azure CLI
You can check Installation of Azure CLI from here for your preferred workstation - https://learn.microsoft.com/en-us/cli/azure/install-azure-cli. Example for MacOS
brew update && brew install azure-cli
# confirm the CLI version
az version
{
"azure-cli": "2.50.0",
"azure-cli-core": "2.50.0",
"azure-cli-telemetry": "1.0.8",
"extensions": {}
}
2. Log In to Azure
You can use multiple methods to login through the Azure CLI
# browser based login
az login
# with username and password
az login --user <user> --password <pass>
# login with the tenant
az login --tenant <tenantID>
# check Azure login help for other methods to log in
az login --help
3. Get the necessary details for the next steps
- Subscription details - Check if the current subscription you want is correct or not
az account show
- Region - Check the region where you want to deploy your k8s cluster. Also make sure that GPU workloads are available in your region. You can check types of GPU instances and Availability of GPU instances in your specific region. To get a list of all regions
az account list-locations -o table
- Make sure to use the second column as the region name in the above output.
- Resource group - A new resource group is recommended. However an existing resource group can also be used.
- To create a new resource with some tags, execute the command in step 5.
4. Export all the necessary details into a variable
## subscription details
export SUBSCRIPTION_ID=""
export SUBSCRIPTION_NAME=""
# location
export LOCATION=""
# resource group
export RESOURCE_GROUP=""
# name of the user assigned identity (step 6)
export USER_ASSIGNED_IDENTITY=""
# name of the cluster
export CLUSTER_NAME=""
Set the subscription ID for all the below steps and add the aks-preview
extension.
az account set --subscription $SUBSCRIPTION_ID
az extension add --name aks-preview
az extension update --name aks-preview
5. Creating Resource group
All the Azure resources (mostly) are deployed in some resource group. For our AKS cluster we will create a resource group. We are naming it as tfy-datascience
but feel free to name it according to your preferred naming conventions. We are creating two tags team=datascience
and owner=truefoundry
RESOURCE_GROUP_ID=$(az group create --name $RESOURCE_GROUP \
--location $LOCATION \
--tags team=datascience owner=truefoundry \
--query 'id' --output tsv)
6. Create user assigned identity
To authenticate to AKS cluster post-creation we need to create a user-assigned identity. Managed Identity is the way to authenticate to Azure resource (AKS here) using Azure AD. There are two kinds of managed identities and we will use user-assigned identities among them. Copy the unique ID of the user assigned identity from the below steps
USER_ASSIGNED_IDENTITY_ID=$(az identity create \
--resource-group $RESOURCE_GROUP \
--name $USER_ASSIGNED_IDENTITY \
--query 'id' --output tsv)
7. Creating AKS Cluster
We can create AKS cluster in mostly two ways. You can chose any one of the following ways.
A. Creating AKS cluster without specifying network requirements
In this we can skip the network requirements during AKS creation as it is handled automatically by Azure. We are using tfy-aks-cluster
as the cluster name and node pool size will autoscale from 2 to 4 nodes. You need to pass the user assigned identity through the argument --assign-identity
az aks create \
--name $CLUSTER_NAME \
--resource-group $RESOURCE_GROUP \
--enable-workload-identity \
--enable-managed-identity \
--assign-identity $USER_ASSIGNED_IDENTITY_ID \
--enable-oidc-issuer \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--kubernetes-version 1.26 \
--location $LOCATION \
--min-count 1 \
--max-count 2 \
--node-count 1 \
--network-plugin kubenet \
--node-vm-size Standard_D2s_v5 \
--node-osdisk-size 100 \
--nodepool-labels class.truefoundry.io=initial \
--nodepool-taints CriticalAddonsOnly=true:NoSchedule \
--enable-node-restriction \
--generate-ssh-keys \
--tags team=datascience owner=truefoundry
Get the kubeconfig
file for the AKS cluster
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
B. Creating AKS cluster with specific network requirements
- Exporting the address and the vnet name
- You can export the below command
export VNET_NAME="" export VNET_ADDRESS_PREFIX="" export SUBNET_ADDRESS_PREFIX=""
- You can use this as an example for the default prefixes
export VNET_NAME="tfy-virtual-net" export VNET_ADDRESS_PREFIX="192.168.0.0/16" export SUBNET_ADDRESS_PREFIX="192.168.1.0/24"
- You can export the below command
- Creating a virtual network
tfy-virtual-net
. Make sure to copy the unique ID of the Virtual network created. All the nodes will be part of this virtual network.VNET_SUBNET_ID=$(az network vnet create \ --resource-group $RESOURCE_GROUP \ --name $VNET_NAME \ --address-prefix $VNET_ADDRESS_PREFIX \ --location $LOCATION \ --subnet-name $VNET_NAME-subnet \ --subnet-prefixes $SUBNET_ADDRESS_PREFIX \ --query 'newVNet.subnets[0].id' --output tsv)
- Create an AKS cluster
tfy-aks-cluster-with-vnet
with the above network. We are using the user assigned identity we created above along with the unique ID of the virtual network. We are again setting the node pool size to autoscale from 2 to 4 nodes.az aks create \ --name $CLUSTER_NAME \ --resource-group $RESOURCE_GROUP \ --enable-workload-identity \ --enable-managed-identity \ --assign-identity $USER_ASSIGNED_IDENTITY_ID \ --network-plugin kubenet \ --enable-oidc-issuer \ --enable-cluster-autoscaler \ --enable-encryption-at-host \ --kubernetes-version 1.26 \ --location $LOCATION \ --min-count 1 \ --max-count 2 \ --node-count 1 \ --node-vm-size Standard_D2s_v5 \ --node-osdisk-size 100 \ --nodepool-labels class.truefoundry.io=initial \ --nodepool-taints CriticalAddonsOnly=true:NoSchedule \ --enable-node-restriction \ --vnet-subnet-id $VNET_SUBNET_ID \ --service-cidr 10.0.0.0/16 \ --dns-service-ip 10.0.0.10 \ --pod-cidr 10.244.0.0/16 \ --docker-bridge-address 172.17.0.1/16 \ --generate-ssh-keys \ --tags team=datascience owner=truefoundry
- Getting the
kubeconfig
file for the AKS clusteraz aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
az aks create commanad Error: unrecognized arguments: --enable-node-restriction, --enable-workload-identity
While creation of aks if you face this error it is because of
Attaching a user based spot node pool
It is advised to attach a user spot node pool in AKS to schedule your workloads that can handle interruptions. There are two kinds of node pools available in Azure system
and user
. System is used to assign AKS related applications and workloads. User
is only used to assign workloads. Moreover each of these node pools can also contains instances which are of type on-demand
or spot
.
The first node pool we created was of type on-demand and now we will create one of type spot.
az aks nodepool add \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name spotnodepool \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--node-vm-size Standard_D4s_v5 \
--min-count 1 \
--node-count 1 \
--max-count 10 \
--mode User \
--node-osdisk-size 100 \
--tags team=datascience owner=truefoundry \
--no-wait
Update autoscaling configurations
Run the below command to update the default cluster-autoscaler configurations
az aks update \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--cluster-autoscaler-profile expander=random scan-interval=30s max-graceful-termination-sec=180 max-node-provision-time=5m ok-total-unready-count=0 scale-down-delay-after-add=2m scale-down-delay-after-delete=30s scale-down-unneeded-time=1m scale-down-unready-time=2m scale-down-utilization-threshold=0.3 skip-nodes-with-local-storage=true skip-nodes-with-system-pods=true
Attaching a user based GPU node pool (on-demand) [GPU]
As we deploy machine learning models we might want to deploy the GPU node pool so that we can bring the GPU instances. You can check types of GPU instances and Availability of GPU instances in your specific region.
In the below command we have assumed we will use NC6 for the node pools. Make sure to set the min, max and the required count according to the specific needs.
az aks nodepool add \
--cluster-name $CLUSTER_NAME \
--name gpupoolnc6 \
--resource-group $RESOURCE_GROUP \
--enable-cluster-autoscaler \
--enable-encryption-at-host \
--node-vm-size Standard_NC6 \
--node-taints nvidia.com/gpu=Present:NoSchedule \
--max-count 2 \
--min-count 1 \
--node-count 1 \
--node-osdisk-size 100 \
--mode user \
--tags team=datascience owner=truefoundry
Read Understanding Azure Node Pools for more details on how to configure your node pools
Updated 9 months ago