Azure Infra requirements

For clusters that are to be onboarded there are certain requirements which must be fulfilled. These requirements vary from network, CPU, GPUs and access.

Common requirements

  • Following things must be installed. However if you are using Onboarding CLI then these will be installed by default

Network requirements

  • For existing network following things must be met
    • Enough no of IP address must be free. Check Creating cluster in an existing network
    • Your subnet inside this existing network shouldn't use 10.244.0.0/16 and 10.255.0.0/16 CIDR blocks as these are used by the cluster by default.
  • For new network following things must be met
    • You will be asked for the CIDR of your network if you are deploying your cluster in a new nwtwork. It is recommended to have /8 as your CIDR block. Check creating your cluster in a new VPC
    • Subnet range should be in /16 or lower.
    • The new VPC must be having a range which is non-conflicting with your other networks, if you want to do network peering in future.
  • Security groups
    • Allow node to node connectivity
    • Allow Egress traffic from nodes
    • Ingress traffic at port 80, 443
  • For setting up DNS so that endpoints get exposed.
    • There is minimum one endpoint for workload cluster which will be used for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (*.ml.<org-name>.com, *.tfy.<org-name>.com , *.ml-apps.<org-name>.com etc.). The domain is not required to be wildcard only, we support endpoint based routing along with sub domain based routing.
    • When Istio will be deployed then a load balancer IP address will come up which needs to be mapped to the domain for your workloads in your DNS provider.
    • TLS/SSL termination can happen in three ways
      • Using cert-manager - cert-manager can be installed in the cluster which can then talk to your DNS provider to create DNS challenges in order to create secrets in the cluster which can be then be used by Isito Gateway
      • Certificate and key-pair file - Raw certificate file can also be used. For this a secret must be created which contains the TLS certificate. Refer here for more details on this. Secret then can be passed in Istio Gateway configuration.

Compute requirements

Compute requirements refers to the amount of compute (CPU/GPU/memory) that is available for use in your region. In Azure compute requirements refers to setting up of node pool according to your needs.

  • Minimum of 2 node pools must be created to ensure smooth functioning of cluster. However is you are using Onboarding CLI then these will be created by default.
    • Critical Node pool - This node pool should atleast contain 2 nodes of min 2vCPU and 4 GB RAM
      • This node pool can contain a taint CriticalAddonsOnly=true:NoSchedule to restrict other workloads to get deployed on it.
      • This node pool is used to deploy the agent and the critical components which powers important parts of the platform. These components are necessary for smooth functioning of the platform. These components are argocd, argo-rollouts, tfy-agent and istio.
      • Autoscaling is not required for this node pool but can be set to min 2 to max 3 nodes.
    • Spot Node pool - This node pool should atleast container 1 node of min 4vCPU and 8 GB RAM.
      • This must be a spot node pool which is used to deploy remaining components of the platform.
      • These components are heavy in compute and can handle interruptions so are deployed on spot.
      • It is recommended to enable cluster autoscaler for this node pool. Check how to setup or update cluster autoscaler in your AKS cluster.
      • A GPU node pool must be attached if there are requirement to use GPUs in the platform. The GPU node pool must have the below taint attached to it.
        key: nvidia.com/gpu
        value: Present
        effect: NoSchedule
        
    • To understand more on what node pools are to be attached, check the following document on understanding Azure node pool
  • Make sure you have enough compute quotas available for your workloads.

Authentication

To create a Kubernetes cluster and all the required resources. you must have the following criteria accomplished

  • If you are using onboarding CLI or Azure CLI your must have the following authentication setup for your user.
    • You must have a valid azure subscription with billing enabled.
    • You must have a user with the below permissions
      • Contributor Role to the above Subscription.
      • Role Based Access Administrator to the above subscription
      • You can read the following document to understand how to give Admin privileges to your subscription