Architecture & Infrastructure Requirements

This guide describes the architecture diagram, access policies and infrastructure requirements to set up compute plane in your AWS account

AWS Architecture Diagram

Please refer to the "Access Policies" section for details of each access policy.

Access Policies

Access PolicyRoleReason
ELBControllerPolicy - link<cluster_name>-elb-controllerRole assumed by load balancer controller to provision ELB when a service of type LoadBalancer is created
KarpenterPolicy - link
SQSPolicy - link
<cluster_name>-karpenterRole assumed by Karpenter to dynamically provision nodes. Karpenter has an additional role to listen to interruption events coming from SQS to safely handle spot node termination
EFSPolicy - link<cluster_name>-efsRole assumed by EFS CSI to provision and attach EFS volumes
EBSPolicy - link<cluster_name>-csi-ebsRole assumed by EBS CSI to provision and attach EBS volumes
RolePolicy -
- ECR - link
- S3 - link
- SSM - link
<cluster_name>-platform-iam-roleRole assumed by TrueFoundry to allow for
- ECR
- S3
- SSM
The role attaches these policies -
- AmazonEKSClusterPolicy - link
- AmazonEKSVPCResourceControllerPolicy - link
- EncryptionPolicy to create and manage key for encryption:
{ "Statement": \[ { "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ListGrants", "kms:DescribeKey" ], "Effect": "Allow", "Resource": "arn:aws:kms:<region>:\<acc_id>:key/\<key_id>" } ], "Version": "2012-10-17" }
<cluster_name>-cluster-<random_string>This role provides Kubernetes the permissions needed to manage the cluster. This includes permissions needed to
- Manage the end to end lifecycle of EC2 instances used as EKS nodes
- Assign networking components to EC2 instances
- Perform encryption at rest
The role attaches these policies -
- AmazonEC2ContainerRegistryReadOnlyPolicy - link
- AmazonEKS_CNI_Policy - link
- AmazonEKSWorkerNodePolicy - link
- AmazonSSMManagedInstanceCorePolicy - link
initial-eks-node-group-<random_string>Role assumed by EKS nodes to work with the AWS resources for these purposes -
- Pull images from ECR
- Assign IPs to the EC2 instance
- Register itself with the cluster
- Perform disk encryption

Requirements

Following is the list of requirements to set up compute plane in your AWS account

RequirementsDescriptionReason for Requirement
AWS AccountBilling must be enabled for the AWS account.
VPCExisting VPC
- Minimum 2 private subnets in different availability zone with min CIDR /24
- Tags should be present on the VPC as described below .
- NAT gateway for private subnets
- Minimum 1 public subnet for a public load balancer if endpoints are to be exposed to internet. Auto-assign IP address must be enabled. Min CIDR /28
- DNS support and DNS hostnames must be enabled for your VPC

New VPC
- CIDR should be atleast /20 for the VPC
- Min 2 availability zone and /24 for private subnets
This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased.

A NAT Gateway must be connected in the VPC and the route tables should allow outbound internet access for private subnets through this NAT gateway.
Egress access For Docker Registry1. public.ecr.aws
2. quay.io
3. ghcr.iod
4. docker.io/truefoundrycloud
5. docker.io/natsio
6. nvcr.io
7. registry.k8s.io
This is to download docker images for TrueFoundry, ArgoCD, NATS, GPU operator, ArgoRollouts, ArgoWorkflows, Istio, Keda.
Volumes AWS Elastic Block Store
AWS Elastic File system
TrueFoundry creates volumes for notebooks or for caching volumes.
DNS with SSL/TLSSet of endpoints (preferably wildcard) to point to the deployments being made. Something like .internal.example.com, .external.example.com. An ACM certificate with the chosen domains as SAN is required in the same regionWhen developers deploy their services, they will need to access the endpoints of their services to test it out or call from other services. This is why we need the DNS along with TLS on the compute plane. Its better if we can make it a wildcard since then developers can deploy services like service1..internal.example.com, service2.internal.example.com
ACM Certificate We need to have a certificate for the domains listed above. The certificate ARN will be passed to the Istio Ingress config. If you have a certificate from some other source, that can also work by creating a secret with the certificate in Istio namespace.
ComputeCPU
- All Standard (A, C, D, H, I, M, R, T, Z) Spot/On-demand must have min 4vCPU and 8 GB RAM
GPU
If you are planning to use GPU machines, make sure you have quotas for:
- G and VT Spot/On-demand Instances
- P Spot/On-demand Instance Requests
Inferentia (Optional)
- If you are planning to use Inferentia machines, make sure you have quota for Inferentia Spot/On-demand machines.
This is to make sure that TrueFoundry can bring up the instances as requested by developers. A request needs to be raised to AWS for increasing the limits for instances in case we don't have quotas. You can check and increase your quotas at AWS EC2 service quotas
User / ServiceAccount to provision the cluster1.sts must be enabled for the user which is being used to create the cluster.
2. User must have the list of permissions listed below
See Enabling STS in a region

VPC Tags

The Private subnets must have the following tags available. This is not required if you are using Onboarding CLI

"kubernetes.io/cluster/${clusterName}": "shared"
"subnet": "private"
"kubernetes.io/role/internal-elb": "1"

Public Subnets must have the following tags available. This is not required if you are using Onboarding CLI.

"kubernetes.io/cluster/${clusterName}": "shared"
"subnet": "public"
"kubernetes.io/role/elb": "1"

Permissions required to create the infrastructure

Following permissions are required by the user running the OCLI to create the cluster

export REGION="" # us-east-1
export SHORT_REGION="" #usea1
export ACCOUNT_ID="" #123524493244
export NAME="" 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:*",
                "eks:*",
                "elasticfilesystem:*",
                "kms:*",
                "route53:AssociateVPCWithHostedZone",
                "sts:GetCallerIdentity",
                "iam:GetRole"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:*"
            ],
            "Resource": "arn:aws:dynamodb:$REGION:$ACCOUNT_ID:table/$NAME-$REGION-tfy-ocli-table"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:AddRoleToInstanceProfile",
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:GetInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:TagInstanceProfile"
            ],
            "Resource": "arn:aws:iam::$ACCOUNT_ID:instance-profile/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreateOpenIDConnectProvider",
                "iam:DeleteOpenIDConnectProvider",
                "iam:GetOpenIDConnectProvider",
                "iam:TagOpenIDConnectProvider"
            ],
            "Resource": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:DeletePolicy",
                "iam:GetPolicy",
                "iam:TagPolicy",
                "iam:GetPolicyVersion",
                "iam:ListPolicyVersions"
            ],
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:policy/tfy-*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_Karpenter_Controller_Policy*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_CNI_Policy*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:*"
            ],
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:role/tfy-*",
                "arn:aws:iam::$ACCOUNT_ID:role/initial-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::tfy-$SHORT_REGION-$NAME-ml*",
                "arn:aws:s3:::tfy-$SHORT_REGION-$NAME-ml*/*",
                "arn:aws:s3:::$NAME-$REGION-tfy-ocli-bucket",
                "arn:aws:s3:::$NAME-$REGION-tfy-ocli-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "events:*"
            ],
            "Resource": "arn:aws:events:$REGION:$ACCOUNT_ID:rule/tfy-$SHORT_REGION-$NAME*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sqs:*"
            ],
            "Resource": "arn:aws:sqs:$REGION:$ACCOUNT_ID:tfy-$SHORT_REGION-$NAME-karpenter"
        }
    ]
}