Creating an AKS cluster using onboarding-cli

The Onboarding CLI is a powerful command-line tool designed to streamline the process of deploying Azure Kubernetes Service (AKS) clusters along with their essential requirements. Developed to simplify the setup of Kubernetes clusters, this CLI automates the entire deployment process, minimising manual intervention and enabling users to focus on their core tasks. By asking a few crucial inputs from the user, the CLI swiftly configures the necessary infrastructure, easing the burden of cluster creation and management.

Pre-requisites

  1. Download azure cli >= 2.50

  2. Download git

  3. You must have a subscription and a user in Azure to create resources. This user should have

    1. Contributor Role in the Subscription
    2. RBAC admin role in the Subscription
  4. Login to azure and set the subscription

    # login
    az login
    
    # setting the subscription
    az account set --subscription $SUBSCRIPTION_ID
    
  5. Ensuring Azure Infrastructure requirements are read carefully.

Download the CLI

  1. Download the binary using the below command.
    1. For Apple Silicon MacOS
      curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/binaries/ocli_darwin_arm64 -o ocli
      
    2. For Intel MacOS
      curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/binaries/ocli_darwin_amd64 -o ocli
      
    3. For Linux (arm)
      curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/binaries/ocli_linux_arm64 -o ocli
      
    4. For Linux (amd)
      curl -H 'Cache-Control: max-age=0' -s https://releases.ocli.truefoundry.tech/binaries/ocli_linux_amd64 -o ocli
      
  2. Make the binary executable and move it to $PATH
    sudo chmod +x ./ocli
    sudo mv ocli /usr/local/bin
    
  3. Confirm by running the command
    ocli 
    

🚧

Update to latest version

Always make sure to update ocli to the latest version.

Installation

Creating a config file

  1. In this document we will check what are the options available for configuring Azure AKS cluster.

  2. There are two ways to go about it

    1. Existing Network - This is the case when you have an already existing network setup for your existing Azure environment. The onboarding CLI can use existing Virtual Network ID and Subnet ID.
    2. New Network - If you don't have any existing network or want to deploy the Truefoundry AKS cluster inside a new Virtual network then you can select this option. In this option you will be prompted for the Network CIDR which is the CIDR range of the Network you want. It will also prompt you for the subnet CIDR range. If you are not sure 10.10.0.0/16 will be taken as default. You will also be asked for private and public CIDRS.
  3. Run the below command

    ocli infra init
    
  4. Screen will be cleared and you will be asked for cloud provider choice. Select azure here and for the next question, select the subscription where you want to deploy all your resources.

    Truefoundry is a platform that makes it very easy to deploy microservices, ML models training jobs, LLMs on Kubernetes. We will start the process of bootstrapping a Kubernetes cluster. This CLI is useful only if you don't have a Kubernetes cluster. If you already have a cluster, please go to https://docs.truefoundry.com/docs/creating-your-own-kubernetes-cluster
    Let's get started!
    
    1. Cloud Provider
    In which cloud provider you would like to deploy your cluster: :
       aws
    >  azure
       gcp
    
    2. Subscription details
    Which Azure Subscription you want to use ?:
    >  Microsoft Azure Sponsorship: xxxxx-xxxxx-xxxxx-xxxxxxxxx
       subscription-name: xxxxx-xxxxx-xxxxx-xxxxxxxxx
    

🚧

failed to acquire a token

The above error indicates that azure-cli cli is not present in your local machine or you are not authenticated usingazure-cli . Make sure you have downloaded the azure-cli and you are authenticated using az login.

  1. Select the location from the drop down. Give a name to your cluster. This name will act as a substring to your actual cluster name.

    3. Location
    In which location you want to deploy your cluster:
    >  Australia Central
       Australia Central 2
       
    4. Cluster name
    What should be the name of your cluster:
    
  2. Select whether you want to deploy the cluster in an existing resource group or a new resource group.

    5. Resource group
    Do you want to deploy the cluster in an existing resource group or a new resource group: :
    >  existing
       new
    

Existing Resource group

For existing resource group

  1. Select the resource group where you want to deploy your cluster
    5(A). In which resource group you want to deploy your cluster:
    >  resourceGroup1
       resourceGroup2
    
  2. For an existing resource group you can select an existing network or create a new network
    6. Vnet details
    Do you want to deploy the cluster in an existing vnet or a new vnet: :
    >  existing
       new
    

Existing Network

  1. Select the network from the drop down
    6(A). Virtual network
    In which network you want to deploy your cluster: :
    >  vnet1
       vnet2
    
  2. Select the subnet from the drop down
    6(B). Subnet details
    In which network you want to deploy your cluster: :
    >  vnet1-default-subnet
       frontend-subnet
       backend-subnet
    

This will create a config file with a name config.yaml which will contain existing network and subnet ID

aws: null
azure:
    cluster:
        name: clustername
    location: West Europe
    network:
        existing: true
        subnet_cidr: ""
        subnet_id: /subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/resourceGroup1/providers/Microsoft.Network/virtualNetworks/vnet1/subnets/vnet1-default-subnet
        vnet_cidr: ""
        vnet_id: /subscriptions/xxxxx-xxxxx-xxxxx-xxxxxxxxx/resourceGroups/resourceGroup1/providers/Microsoft.Network/virtualNetworks/vnet1
        vnet_name: vnet1
    resource_group:
        existing: true
        name: resourceGroup1
    state:
        container_name: ""
        resource_group: ""
        storage_account_name: ""
    subscription:
        id: xxxxx-xxxxx-xxxxx-xxxxxxxxx
        name: subscription-name
binaries:
    terraform:
        binary_path: null
    terragrunt:
        binary_path: null
gcp: null
provider: azure

New Network (recommended)

  1. In case of a new network supply a CIDR range ( Default: 10.0.0.0/8)
    6(A). Virtual network CIDR
    What is your expected vnet CIDR (default: 10.0.0.0/8):
    
  2. For subnet (Default: 10.0.0.0/16)
    6(B). Subnet CIDR
    What is your expected subnet CIDR (default: 10.10.0.0/16):
    

This will create a YAML file with the name of config.yaml containing the inputs

aws: null
azure:
    cluster:
        name: clustername
    location: West Europe
    network:
        existing: false
        subnet_cidr: 10.10.0.0/16
        subnet_id: ""
        vnet_cidr: 10.0.0.0/8
        vnet_id: ""
        vnet_name: ""
    resource_group:
        existing: true
        name: resourceGroup1
    state:
        container_name: ""
        resource_group: ""
        storage_account_name: ""
    subscription:
        id: xxxxx-xxxxx-xxxxx-xxxxxxxxx
        name: subscription-name
binaries:
    terraform:
        binary_path: null
    terragrunt:
        binary_path: null
gcp: null
provider: azure

New Resource Group (recommended)

If you want to deploy everything in a new separate resource group.

  1. Enter the name of the resource group you want to create.
  2. For a new resource group, it is obvious to have a new network. For this you can supply a CIDR range ( Default: 10.0.0.0/8)
    6(A). Virtual network CIDR
    What is your expected vnet CIDR (default: 10.0.0.0/8):
    
  3. For subnet (Default: 10.0.0.0/16)
    6(B). Subnet CIDR
    What is your expected subnet CIDR (default: 10.10.0.0/16):
    

This will create a YAML file named config.yaml with the given inputs

aws: null
azure:
    cluster:
        name: clustername
    location: West Europe
    network:
        existing: false
        subnet_cidr: 10.10.0.0/16
        subnet_id: ""
        vnet_cidr: 10.0.0.0/8
        vnet_id: ""
        vnet_name: ""
    resource_group:
        existing: false
        name: newrg
    state:
        container_name: ""
        resource_group: ""
        storage_account_name: ""
    subscription:
        id: xxxxx-xxxxx-xxxxx-xxxxxxxxx
        name: subscription-name
binaries:
    terraform:
        binary_path: null
    terragrunt:
        binary_path: null
gcp: null
provider: azure

Running the config file

Once the config file is created, you can run it by the following command

ocli infra create --file config.yaml

❗️

Resource group could not be found.

This error can come up if you haven't set the subscription ID from azure-cli. Make sure to run

az account set --subscription $SUBSCRIPTION_ID

❗️

The specified service CIDR 10.0.0.0/16 is conflicted with an existing subnet CIDR

This can happen when you are using existing network

❗️

Error in creating Agent pool

This error can come up because of low quotas for spot or regional instances. Below is the error format

╷
│ Error: creating Agent Pool (Subscription: "xxxxx-xxx-x-xxxxxxx"
│ Resource Group Name: "REDACTED"
│ Managed Cluster Name: "REDACTED"
│ Agent Pool Name: "spotpoold307"): agentpools.AgentPoolsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="PreconditionFailed" Message="Provisioning of resource(s) for Agent Pool spotpoold307 failed. Error: {\n  \"code\": \"InvalidTemplateDeployment\",\n  \"message\": \"The template deployment '5b1d1959-dca5-425a-a479-963bed76ae3b' is not valid according to the validation procedure. The tracking id is 'd33a7636-891f-4e51-b475-dd85f3e95156'. See inner errors for details.\",\n  \"details\": [\n   {\n    \"code\": \"QuotaExceeded\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved LowPriorityCores quota. Additional details - Deployment Model: Resource Manager, Location: LOCATION, Current Limit: 3, Current Usage: 0, Additional Required: 4, (Minimum) New Limit Required: 4. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22c03bdc39-be28-4bb7-8953-1339b663e8d0%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22westus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22lowPriorityCores%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:4,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22lowPriorityCores%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-portal/supportability/low-priority-quota\"\n   }\n  ]\n }"
│ 
│   with module.aks.azurerm_kubernetes_cluster_node_pool.node_pool["spot"],
│   on .terraform/modules/aks/main.tf line 523, in resource "azurerm_kubernetes_cluster_node_pool" "node_pool":
│  523: resource "azurerm_kubernetes_cluster_node_pool" "node_pool" {
│ 
╵

Post cluster configurations

Saving the output

The above process generates some output which are helpful for deployment of some applications. For this save the output in some file

ocli infra output --file config.yaml > output.txt

Downloading the kubeconfig file

  1. Once the cluster gets created we need to attach this cluster to the TrueFoundry platform.
  2. Export the important variables
    export RESOURCE_GROUP=""
    export CLUSTER_NAME=""
    
  3. Run the below command to get its kubeconfig file on your local
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME

Connecting the cluster to the platform

Follow the Connecting the cluster guide so as to connect the cluster to TrueFoundry's platform. Once this is done there are few applications that are to be installed in the cluster, for which the output.txt needs to be given to the Truefoundry's team.