AWS
This page provides an architecture overview, requirements and steps to setup a TrueFoundry compute plane cluster in AWS
The architecture of a Truefoundry compute plane is as follows:
Requirements:
The requirements to setup compute plane in each of the scenarios is as follows:
- Billing must be enabled for the AWS account.
- Please make sure you sts enabled and have enough permissions to create the resources needed
- Please make sure you have enough quotas for GPU/Inferentia instances on the account depending on your usecase. You can check and increase quotas at AWS EC2 service quotas
- Please make sure you have created a certifcate for your domain in AWS Certificate Manager (ACM) and have the ARN of the certificate ready. This is required to setup TLS for the load balancer. Check this for more details.
Regarding the VPC and EKS cluster, you can decide between the following scenarios:
- The new VPC should will have a CIDR range of /20 or larger, at least 2 availability zones and private subnets with CIDR
/24
or larger. This is to ensure capacity for ~250 instances and 4096 pods. - If you are using custom networking, you need to have CGNAT IP address in each AZ. CGNAT space and route tables will be setup in the VPC.
- A NAT gateway will be provisioned to provide internet access to the private subnets.
- We should have egress access to
public.ecr.aws
,quay.io
,ghcr.io
,tfy.jfrog.io
,docker.io/natsio
,nvcr.io
,registry.k8s.io
so that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc. - We need a domain to map to the service endpoints. A wildcard domain like *.services.example.com is preferred. Truefoundry can do path based routing like
services.example.com/tfy/*
, however, many frontend applications do not support this. - We will need a certificate ARN (for the domain provided above) to attach to the loadbalancer so as to terminate TLS traffic at the load balancer.
- The new VPC should will have a CIDR range of /20 or larger, at least 2 availability zones and private subnets with CIDR
/24
or larger. This is to ensure capacity for ~250 instances and 4096 pods. - If you are using custom networking, you need to have CGNAT IP address in each AZ. CGNAT space and route tables will be setup in the VPC.
- A NAT gateway will be provisioned to provide internet access to the private subnets.
- We should have egress access to
public.ecr.aws
,quay.io
,ghcr.io
,tfy.jfrog.io
,docker.io/natsio
,nvcr.io
,registry.k8s.io
so that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc. - We need a domain to map to the service endpoints. A wildcard domain like *.services.example.com is preferred. Truefoundry can do path based routing like
services.example.com/tfy/*
, however, many frontend applications do not support this. - We will need a certificate ARN (for the domain provided above) to attach to the loadbalancer so as to terminate TLS traffic at the load balancer.
- The existing VPC should have min 2 private subnets in different AZs with CIDR /24. This ensures capacity for ~250 instances and 4096 pods.
- The VPC should have NAT gateway for private subnets.
- There should be atleast one public subnet in the existing VPC with CIDR /28 for the public load balancer.
- The VPC should have Auto-assign IP address enabled. It should also have DNS support and DNS hostnames enabled.
- If you are using custom networking, you need to have CGNAT space subnets.
- We need a domain to map to the service endpoints. A wildcard domain like *.services.example.com is preferred. Truefoundry can do path based routing like
services.example.com/tfy/*
, however, many frontend applications do not support this. - We will need a certificate ARN (for the domain provided above) to attach to the loadbalancer so as to terminate TLS traffic at the load balancer.
Your subnets must have the following tags for the TrueFoundry terraform code to work with them.
Resource Type | Required Tags | Description |
---|---|---|
Private Subnets | - kubernetes.io/cluster/${clusterName} : "shared" - subnet : "private" - kubernetes.io/role/internal-elb : "1" | Tags required for EKS to properly manage internal load balancers and subnet identification |
Public Subnets | - kubernetes.io/cluster/${clusterName} : "shared" - subnet : "public" - kubernetes.io/role/elb : "1" | Tags required for EKS to properly manage external load balancers and subnet identification |
EKS Node Security Group | - karpenter.sh/discovery : "${clusterName}" | This tag is required for Karpenter to discover and manage node provisioning for the cluster |
- EKS Version should be 1.30 or higher.
- EBS CSI Driver should be installed Installation Guide - Required for persistent volume support for Notebooks, SSH.
- EFS CSI Driver should be installed Installation Guide - Required for model and data caching.
- AWS Load Balancer Controller (>=v2.12.0) should be installed Installation Guide - Required for Ingress and Service type LoadBalancer support. The appropriate IAM roles for service account (IRSA) should be created.
- Although this is not compulsory, we highly recommend Karpenter to be installed on the cluster. It makes a lot of functionalities in Truefoundry easier, faster and cost-effective.
Setting up compute plane
TrueFoundry compute plane infrastructure is provisioned using terraform. You can download the terraform code for your exact account by filling up your account details and downloading a script that can be executed on your local machine.
Choose to create a new cluster or attach an existing cluster
Go to the platform section in the left panel and click on Clusters
. You can click on Create New Cluster
or Attach Existing Cluster
depending on your use case. Read the requirements and if everything is satisfied, click on Continue
.
Fill up the form to generate the terraform code
A form will be presented with the details for the new cluster to be created. Fill in with your cluster details. Click Submit
when done
The key fields to fill up here are:
Cluster Name
- A name for your cluster.Region
- The region where you want to create the cluster.Network Configuration
- Choose betweenNew VPC
orExisting VPC
depending on your use case.Authentication
- This is how you are authenticated to AWS on your local machine. It’s used to configure Terraform to authenticate with AWS.S3 Bucket for Terraform State
- Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. The new bucket will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration, ParameterStore, DockerRegistry and SecretsManager will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
The key fields to fill up here are:
Cluster Name
- A name for your cluster.Region
- The region where you want to create the cluster.Network Configuration
- Choose betweenNew VPC
orExisting VPC
depending on your use case.Authentication
- This is how you are authenticated to AWS on your local machine. It’s used to configure Terraform to authenticate with AWS.S3 Bucket for Terraform State
- Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. The new bucket will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration, ParameterStore, DockerRegistry and SecretsManager will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
The key fields to fill up here are:
Region
- The region where your cluster is already created.Cluster Configuration
- Provide the details of the existing cluster like the name of the cluster, URL of the OIDC provider, and the other required ARNs on the form.Cluster Addons
- Truefoundry needs to install addons like ArgoCD, ArgoWorkflows, Keda, Istio, etc. Please disable the addons that are already installed on your cluster so that truefoundry installation does not overrride the existing configuration and affect your existing workloads.Network Configuration
- Provide the details of the existing VPC and subnets where the cluster is already created.Authentication
- This is how you are authenticated to AWS on your local machine. It’s used to configure Terraform to authenticate with AWS.S3 Bucket for Terraform State
- Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. The new bucket will automatically be created by our script.Platform Features
- This is to decide which features like BlobStorage, ClusterIntegration, ParameterStore, DockerRegistry and SecretsManager will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
Copy the curl command and execute it on your local machine
You will be presented with a curl
command to download and execute the script. The script will take care of installing the pre-requisites, downloading terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
Verify the cluster is showing as connected in the platform
Once the script is executed, the cluster will be shown as connected in the platform.
Create DNS Record
We can get the load-balancer’s IP address by going to the platform section in the bottom left panel under the Clusters section. Under the preferred cluster, you’ll see the load balancer IP address under the Base Domain URL
section.
Create a DNS record in your route 53 or your DNS provider with the following details
Record Type | Record Name | Record value |
---|---|---|
CNAME | *.tfy.example.com | LOADBALANCER_IP_ADDRESS |
Setup routing and TLS for deploying workloads to your cluster
Follow the instructions here to setup DNS and TLS for deploying workloads to your cluster.
Start deploying workloads to your cluster
You can start by going here
Permissions required to create the infrastructure
Coming soon
Setting up TLS in AWS
There are three ways primarily through which we can add TLS to the load balancer in AWS:
- Using AWS Certificate Manager (recommended) - Through this certs get renewed automatically
- Using Certificate and key files - Through this pre-created certs are added to istio
- Using cert-manager - This allows you to automatically provision and manage TLS certificates from various issuers (like Let’s Encrypt) with any DNS provider