AWS infra requirements
AWS cloud requirements
For clusters that are to be onboarded there are certain requirements which must be fulfilled. These requirements vary from network, CPU, GPUs and access.
Common requirements
- Following things must be installed. However if you are using onboarding CLI then this will be installed by default.
- For control plane deployment you need to install additional tools
Network requirements
Existing VPC
- Below are the requirements if you are bringing an existing VPC for your Kubernetes cluster. These requirements are also required if you are bringing an existing cluster.
- You will be asked for your VPC ID and subnet IDs if you want to deploy the EKS cluster in an existing network. Check Creating cluster in an existing VPC
- Network load balancer - TrueFoundry relies on a network load balancer to be created in the EKS cluster.
- If you already have a cluster make sure to install Istio application so that a gateway can be created which in turn will create a network load balancer.
- TrueFoundry relies on Istio service mesh to create virtual services for the hosted endpoints.
- If you are using Onboarding CLI the network load balancer will be created for you.
- The Subnet can be modified to be internal or external.
- Private subnets will contain EKS node. Each EKS node require certain IP address block that is reserved for node and the pods that get deployed in the cluster. Minimum of two private subnets with at least
/24
CIDR must be available. - The Private subnets must have the following tags available. This is not required if you are using Onboarding CLI
"kubernetes.io/cluster/${clusterName}": "shared" "subnet": "private" "kubernetes.io/role/internal-elb": "1"
- Public Subnets must have the following tags available. This is not required if you are using Onboarding CLI.
"kubernetes.io/cluster/${clusterName}": "shared" "subnet": "public" "kubernetes.io/role/elb": "1"
- DNS support and DNS hostnames must be enabled in your VPC.
- A NAT Gateway must be connected in the VPC and the route tables should allow outbound internet access for private subnets through this NAT gateway.
- For the public subnets the Auto-assign IP address must be enabled if internet routing is required for the endpoints that will get hosted
New VPC
- For a new VPC below are the requirement for IP blocks
- You will be asked for the CIDR of your VPC if you are deploying your cluster in a new VPC. It is recommended to have
/16
as your CIDR block. Check creating your cluster in a new VPC - Private subnets will contain EKS node and the subnet CIDRs should be atleast
/24
(/20
is recommended) - The new VPC ID must be of CIDR range that is non-conflicting with your other VPC for VPC peering in future. For e.g. If your existing resources are deployed in a VPC CIDR (VPC1) of 10.10.0.0/16 then EKS cluster can be created in a VPC of 10.12.0.0/16 (VPC2) so that VPC peering can happen between these two VPCs if deployments in EKS want to connect to resources in VPC1.
- You will be asked for the CIDR of your VPC if you are deploying your cluster in a new VPC. It is recommended to have
DNS
- Following requirements must be met for DNS
- There is minimum one endpoint for workload cluster which will be used for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (
*.ml.<org-name>.com
,*.tfy.<org-name>.com
,*.ml-apps.<org-name>.com
etc.). The domain is not required to be wildcard only, we support endpoint based routing along with sub domain based routing. For control plane deployment, one more non-wildcard endpoint is required to host the control plane UI. - When
Istio
will be deployed then a load balancer address will come up which needs to be mapped to the domain for your workloads. - TLS/SSL termination can happen in three ways
- Using ACM (recommended) - AWS certificate Manager certificate's can directly be used by
Istio Gateway
. For this certificate ARN will be used to pass. You can refer to this example - Using cert-manager -
cert-manager
can be installed in the cluster which can then talk to your DNS provider to create DNS challenges in order to create secrets in the cluster which can be then be used by Isito Gateway - Certificate and key-pair file - Raw certificate file can also be used. For this a secret must be created which contains the TLS certificate. Refer here for more details on this. Secret then can be passed in
Istio Gateway
configuration.
- Using ACM (recommended) - AWS certificate Manager certificate's can directly be used by
- Certificate and key pair can directly be passed in the Istio Gateway TLS settings.
- There is minimum one endpoint for workload cluster which will be used for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (
Compute Requirements
- Compute requirements refers to the amount of compute (CPU/GPU/memory) that is available for use in your region.
- CPU/memory limits can be checked in the AWS EC2 service quotas
- Check for the following types for CPU
- All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests
- If there is a requirement for GPU then the same page can be used to check for following types
- All G and VT Spot/On-demand Instance Requests
- All P Spot/On-demand Instance Requests
- All Inf Spot Instance Requests
- If there are less compute available then a request should be raised to AWS for increasing the limits before onboarding of the cluster.
- TrueFoundry relies on Karpenter for workload provisioning.
- If you are using Onboarding CLI then Karpenter will be installed automatically.
- If you bring your own cluster, then you can install Karpenter manually.
Volumes
TrueFoundry relies on two types of volumes in EKS custer
- AWS Elastic block store - If you are using Onboarding CLI then all the components for the AWS EBS will get installed A default storage class of gp3 volumes will be created with encryption and auto-expansion enabled. For manual installation follow EBS installation.
- AWS Elastic File system If you are using Onboarding CLI then all the components for the AWS EFS will get installed. A default storage class for EFS volumes will be created. For manual installation follow EFS installation.
Storing data
To store data and container images an IAM role is required with access to a S3 bucket, ECR and SSM (parameter store for secrets management).
- If you are using Onboarding CLI then IAM role and an S3 bucket are already created. Check IAM role section to understand how to edit this.
- If you are bringing your own cluster you can follow steps to create an IAM role which has access to ECR, S3 and SSM
Access Requirements
sts
must be enabled for the user which is being used to create the cluster. See Enabling STS in a region- Billing must be enabled for the account in which cluster is getting deployed.
- For cluster creation following access is required
- An admin access with temporary access key and secret key can be given for TrueFoundry team. Credentials can be revoked once the cluster is created.
- If sharing temporary access key and secret key is not possible then you can follow along the Onboarding CLI to create your cluster. If giving admin permissions is not possible for the user running the Onboarding CLI, then we can give you an exact set of permissions to use.
- Post the cluster creation following steps are to be followed if you are not using Onboarding CLI for creating your cluster
Updated 8 months ago