AWS infra requirements

AWS cloud requirements

For clusters that are to be onboarded there are certain requirements which must be fulfilled. These requirements vary from network, CPU, GPUs and access. These requirements are not specific to type of deployment, be it terraform/terragrunt or eksctl/awscli.

📘

Control plane requirements

Below requirements exist for both workload and control plane clusters. However, the requirements which assumes an existing EKS cluster are not valid for Control plane. Control plane clusters require setup from Truefoundry team.

Common requirements

  • If cluster is to be setup at client's end using terraform/ terragrunt then following things must be installed
  • If control plane installation is there

Network requirements

  • For existing VPC enough IPs must be free in your private subnets and public subnets.
    • A network load balancer is created in the public subnet which require 8 IPs to be free in each public subnet.
    • Private subnets will contain EKS node. Each EKS node require certain IP address block that is reserved for node and the pods that get deployed in the cluster. Minimum of two private subnets with at least /24 CIDR must be available.
    • The Private subnets must have the following tags available
      "kubernetes.io/cluster/${clusterName}": "shared"
      "subnet": "private"
      "kubernetes.io/role/internal-elb": "1"
      
    • Public Subnets must have the following tags available
      "kubernetes.io/cluster/${clusterName}": "shared"
      "subnet": "public"
      "kubernetes.io/role/elb": "1"
      
    • DNS support and DNS hostnames must be enabled in your VPC.
    • A NAT Gateway must be connected in the VPC and the route tables should allow outbound internet access for private subnets through this NAT gateway.
    • For the public subnets the Auto-assign IP address must be enabled if internet routing is required for the endpoints that will get hosted
    • Your VPC must have the following tags
      "kubernetes.io/cluster/${clusterName}": "owned"
      
  • For a new VPC below are the requirement for IP blocks
    • A network load balancer is created in the public subnet which require 8 IPs to be free in each public subnet.
    • Private subnets will contain EKS node. Each EKS node require certain IP address block that is reserved for node and the pods that get deployed in the cluster. Minimum of two private subnets with at least /24 CIDR must be available.
    • The new VPC ID must be of CIDR range that is non-conflicting with your other VPC for VPC peering in future. For e.g. If your existing resources are deployed in a VPC CIDR (VPC1) of 10.10.0.0/16 then EKS cluster can be created in a VPC of 10.12.0.0/16 (VPC2) so that VPC peering can happen between these two VPCs if deployments in EKS want to connect to resources in VPC1.
  • Following requirements must be met for DNS
    • There are minimum two endpoints that Truefoundry requires for control plane cluster and minimum one endpoint for workload cluster
    • Control plane
      • One endpoint for hosting the control plane UI. It can be something like truefoundry.<orgname>.ai or platform.<org-name>.com
      • One endpoint for hosting the ML workloads. It should be wildcard at sub domain or sub-sub domain level. For e.g. (*.ml.<org-name>.com)
    • Workload (only)
      • One endpoint for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (*.ml.<org-name>.com). The endpoint is not required to be wildcard only, we support endpoint based routing along with sub domain based routing.
    • If Istio is already deployed then make sure the host field is set in Istio Gateway to the endpoint which you want your workloads to expose (publicly). This endpoint must then be passed in the workload cluster from the UI.
    • If Istio is not deployed then a load balancer address will come up when Istio gets installed in the cluster during onboarding. The value of the loadbalancer's hostname must be mapped as a CNAME record to the endpoint where your workload will be hosted (publicly)
    • TLS/SSL termination can happen in three ways
      • Using ACM (recommended) - AWS certificate Manager certificate's can directly be used by Istio Gateway. For this certificate ARN will be used to pass. You can refer to this example
      • Using cert-manager - cert-manager can be installed in the cluster which can then talk to your DNS provider to create DNS challenges in order to create secrets in the cluster which can be then be used by Isito Gateway
      • Certificate and key-pair file - Raw certificate file can also be used. For this a secret must be created which contains the TLS certificate. Refer here for more details on this. Secret then can be passed in Istio Gateway configuration.
    • Certificate and key pair can directly be passed in the Istio Gateway TLS settings.

Compute Requirements

  • Compute requirements refers to the amount of compute (CPU/GPU/memory) that is available for use in your region.
    • CPU/memory limits can be checked in the AWS EC2 service quotas
    • Check for the following types for CPU
      • All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests
    • By default for the cluster created by Truefoundry, only spot instances are requested but if the cluster is created at the client's end then spot/on-demand request should be made accordingly
    • If there is a requirement for GPU then the same page can be used to check for following types
      • All G and VT Spot/On-demand Instance Requests
      • All P Spot/On-demand Instance Requests
      • All Inf Spot Instance Requests
    • If there are less compute available then a request should be raised to AWS for increasing the limits
  • If using an already existing EKS cluster then CSI driver must be pre-installed so as to allow volumes to be provisioned by Truefoundry.

Storing data

  • Post the EKS cluster gets created following things are required
    • Secrets store - Secret store will be required for using the Secrets functionality on the cluster
    • Docker registry - A docker registry is required if there are builds that needs to take place. For this if you have an already existing docker registry, ECR, ACR or Artifact Registry, it can be used. If not, for AWS a new repo in ECR can be created.
    • Blob storage - For storing artifacts of your ML models a blob storage is required. An already existing blob storage service like AWS s3, Azure blob storage or Google Cloud Storage can be used. If not, for AWS a new bucket must be created.

Access Requirements

  • For cluster creation following access is required
    • An admin access with temporary access key and secret key can be given to TrueFoundry team for deployment.
    • If sharing credentials with admin access is not possible then the terragrunt/terraform modules can be cloned and executed at the client's end which has admin access to their cloud account. With this way there is no need to share any credentials.
    • Credentials can be revoked once the cluster is created.
  • Post the cluster creation following steps
  • sts must be enabled for the user which is being used to create the cluster. See Enabling STS in a region
  • Billing must be enabled for the account in which cluster is getting deployed.