Connect Existing AWS EKS Cluster

Truefoundry can help connect an existing AWS cluster to the control-plane. To do this, you can install the tfy-k8s-aws-eks-inframold helm chart to the cluster to install all the components.

This chart will install all the components needed for Truefoundry compute-plane. You can find the default values of this chart here.

🚧

Please make sure to provide all the required parts in the values file before installing the helm chart. Also make sure that you are not overriding any already installed components in the cluster.

If some component like argocd is already installed on the cluster, you can make the value as false in the values file and then apply the helm chart. We will recommend you to download the values file from the Github repository, modify the values as required and then apply the helm chart using the command below.

helm repo add truefoundry https://truefoundry.github.io/infra-charts/
helm install my-tfy-k8s-aws-eks-inframold truefoundry/tfy-k8s-aws-eks-inframold -f values.yaml

The documentation for different fields can be found in the values file itself. Some of the components like Karpenter, EBS and EFS will require setting up a few roles and permissions on the AWS account. The instructions for setting up these components can be found below.

Setup Karpenter on AWS Account

Karpenter is essential for cluster autoscaling and dynamic provisioning of nodes. Karpenter enables scaling up and down the cluster without doing any preset nodepool configuration. The following steps take care of enabling karpenter on an AWS account.

  1. Create and bootstrap the node role which karpenter nodes will use
$ export CLUSTER_NAME=<cluster_name>

$ export AWS_REGION=""

$ echo '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}' > node-trust-policy.json

$ aws iam create-role --role-name karpenter-node-role-${CLUSTER_NAME} \
    --assume-role-policy-document file://node-trust-policy.json

$ aws iam attach-role-policy --role-name karpenter-node-role-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

$ aws iam attach-role-policy --role-name karpenter-node-role-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

$ aws iam attach-role-policy --role-name karpenter-node-role-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

$ aws iam attach-role-policy --role-name karpenter-node-role-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

$ aws iam create-instance-profile \
    --instance-profile-name karpenter-instance-profile-${CLUSTER_NAME}

$ aws iam add-role-to-instance-profile \
    --instance-profile-name karpenter-instance-profile-${CLUSTER_NAME} \
    --role-name karpenter-node-role-${CLUSTER_NAME}
  1. Create service account for the karpenter controller
$ CLUSTER_ENDPOINT="$(aws eks describe-cluster \
    --name ${CLUSTER_NAME} --query "cluster.endpoint" \
    --output text)"
$ OIDC_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} \
    --query "cluster.identity.oidc.issuer" --output text)"
$ AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' \
    --output text)

$ echo "{
    \"Version\": \"2012-10-17\",
    \"Statement\": [
        {
            \"Effect\": \"Allow\",
            \"Principal\": {
                \"Federated\": \"arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}\"
            },
            \"Action\": \"sts:AssumeRoleWithWebIdentity\",
            \"Condition\": {
                \"StringEquals\": {
                    \"${OIDC_ENDPOINT#*//}:aud\": \"sts.amazonaws.com\",
                    \"${OIDC_ENDPOINT#*//}:sub\": \"system:serviceaccount:karpenter:karpenter\"
                }
            }
        }
    ]
}" > controller-trust-policy.json

$ aws iam create-role --role-name karpenter-controller-role-${CLUSTER_NAME} \
    --assume-role-policy-document file://controller-trust-policy.json

$ echo '{
    "Statement": [
        {
            "Action": [
                "ssm:GetParameter",
                "iam:PassRole",
                "ec2:DescribeImages",
                "ec2:RunInstances",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ec2:DeleteLaunchTemplate",
                "ec2:CreateTags",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet",
                "ec2:DescribeSpotPriceHistory",
                "pricing:GetProducts"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "Karpenter"
        },
        {
            "Action": "ec2:TerminateInstances",
            "Condition": {
                "StringLike": {
                    "ec2:ResourceTag/Name": "*karpenter*"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "ConditionalEC2Termination"
        }
    ],
    "Version": "2012-10-17"
}' > controller-policy.json

$ aws iam put-role-policy --role-name karpenter-controller-role-${CLUSTER_NAME} \
    --policy-name karpenter-controller-policy-${CLUSTER_NAME} \
    --policy-document file://controller-policy.json
  1. We need to tag all the subnets where karpenter nodes should be created
# This will give you all the subnet ids available. Choose the subnets that karpenter should create nodes in
$ aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.resourcesVpcConfig.subnetIds"

# Execute the following two commands for each of the subnets
$ aws ec2 create-tags --tags "Key=kubernetes.io/cluster/${CLUSTER_NAME},Value=shared" --resources <subnet_id>

$ aws ec2 create-tags --tags "Key=subnet,Value=private" --resources <subnet_id>
  1. We also need to tag the security group where the karpenter nodes are to be created
$ SECURITY_GROUP_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)

$ aws ec2 create-tags --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" --resources ${SECURITY_GROUP_ID}
  1. Update the aws-auth configmap for the karpenter nodes to access the control plane and add the section under mapRoles
$ kubectl edit configmap aws-auth -n kube-system
- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::${AWS_ACCOUNT_ID}:role/karpenter-node-role-${CLUSTER_NAME}
  username: system:node:{{EC2PrivateDNSName}}
  1. Enable spot instance creation: If it returns an error, that means spot instances were already enabled.
$ aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

The outputs from the above steps will need to be provided in the Karpenter section in the values file as mentioned below:

  karpenter:
    enabled: true
    ## You can put this value from the CLUSTER_ENDPOINT variable in step 2 above.
    clusterEndpoint: ""
    roleArn: ""
    instanceProfile: ""
    defaultZones: ""

    gpuProvisioner:
      capacityTypes:  ["spot", "on-demand"]
      instanceFamilies: ["p2", "p3", "p4d", "p4de", "p5", "g4dn", "g5"]
      zones: ""

    inferentiaProvisioner:
      capacityTypes: ["spot", "on-demand"]
      instanceFamilies: ["inf1", "inf2"]
      zones: ""

    interruptionQueueName: ""

Setup EBS on AWS Account

The Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes Volumes that you create. To set up EBS with your cluster, we will need to generate a role and provide it in the ebs section in values file.

  1. Substitute the correct values in the script below
export CLUSTER_NAME=""
export AWS_REGION=""
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export OIDC_ENDPOINT=$(aws eks describe-cluster --name ${CLUSTER_NAME} \
    --query "cluster.identity.oidc.issuer" --output text)
  1. Create the following policy document
cat > ebs-assume-role-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_ENDPOINT#*//}:aud": "sts.amazonaws.com",
          "${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:aws-ebs-csi-driver:ebs-csi-controller-sa"
        }
      }
    }
  ]
}
EOF
  1. Create the role using the below command
# Create the role
aws iam create-role \
  --role-name AmazonEKS_EBS_CSI_DriverRole-${CLUSTER_NAME} \
  --assume-role-policy-document file://"ebs-assume-role-policy.json"
  
# Attach the policy
aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --role-name AmazonEKS_EBS_CSI_DriverRole-${CLUSTER_NAME}

The role arn output from step 3 should be provided in the values file:

  awsEbsCsiDriver:
    enabled: true
    roleArn: <Put the value from the first command output in step 3>

Setup EFS on AWS Account

This section describe how you can achieve EFS support in your EKS cluster.

  1. Substitute the correct values in the script below
export CLUSTER_NAME=""
export AWS_REGION=""
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export OIDC_ENDPOINT=$(aws eks describe-cluster --name ${CLUSTER_NAME} \
    --query "cluster.identity.oidc.issuer" --output text)
export VPC_ID=$(aws eks describe-cluster \
    --name "${CLUSTER_NAME}" \
    --query "cluster.resourcesVpcConfig.vpcId" \
    --region "${AWS_REGION}" \
    --output text)
export VPC_CIDR_RANGE=$(aws ec2 describe-vpcs \
    --vpc-ids "${VPC_ID}" \
    --query "Vpcs[].CidrBlock" \
    --output text \
    --region "${AWS_REGION}")
export CLUSTER_SUBNET_LIST=$(aws eks describe-cluster \
    --name "${CLUSTER_NAME}" \
    --query 'cluster.resourcesVpcConfig.subnetIds' \
    --output text)
  1. Create the IAM role
cat > efs-assume-role-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:aws-efs-csi-driver:efs-csi-controller-sa"
        }
      }
    }
  ]
}
EOF

export EFS_ROLE_ARN=$(aws iam create-role \
  --role-name "${CLUSTER_NAME}-csi-efs" \
  --assume-role-policy-document file://"efs-assume-role-policy.json" \
  --query 'Role.Arn' --output text)
  1. Attach the policy to IAM role
aws iam attach-role-policy \
  --policy-arn "arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy" \
  --role-name "${CLUSTER_NAME}-csi-efs"
  1. Creating a security group to allow 2049 port access from VPC.
# create a security group
SECURITY_GROUP_ID=$(aws ec2 create-security-group \
    --group-name TfyEfsSecurityGroup \
    --description "Truefoundry EFS security group" \
    --vpc-id "${VPC_ID}" \
    --region "${AWS_REGION}" \
    --output text)

# authorize the security group to connect from VPC CIDR. It can be customized to be connected from subnets 
# IDs of the security group
aws ec2 authorize-security-group-ingress \
    --group-id $SECURITY_GROUP_ID \
    --protocol tcp \
    --port 2049 \
    --region "${AWS_REGION}" \
    --cidr "${VPC_CIDR_RANGE}"
  1. Create the efs file system and mount them to cluster subnets
FILE_SYSTEM_ID=$(aws efs create-file-system \
    --region "${AWS_REGION}" \
    --performance-mode generalPurpose \
    --encrypted \
    --throughput-mode elastic \
    --tags Key=Name,Value="${CLUSTER_NAME}-efs" Key=Created-By,Value=Truefoundry Key=cluster-name,Value=$CLUSTER_NAME \
    --query 'FileSystemId' \
    --output text)

for subnet_id in ${CLUSTER_SUBNET_LIST[@]}; do 
	aws efs create-mount-target \
  --file-system-id "${FILE_SYSTEM_ID}" \
  --subnet-id $subnet_id \
  --security-groups "${SECURITY_GROUP_ID}" \
  --region "${AWS_REGION}"
done