Creating an EKS cluster using onboarding-cli

In this document we will check how can we create a fresh EKS cluster using onboarding CLI.

Pre-requisites

  1. Download aws cli == 2.x.x
  2. Download git
  3. Create an AWS profile locally which is using an IAM user having admin access to the AWS account where you want to deploy the cluster.
  4. Ensuring AWS Infrastructure requirements are read carefully.
  5. Install ocli

Installation

Creating a config file

  1. In this document we will check what are the options available for configuring AWS EKS cluster.

  2. There are two options available for the AWS network

    1. Existing VPC - This is the case when you have an already existing network for your existing AWS services. The onboarding CLI can use the existing VPC to deploy the EKS cluster inside it. Read Existing VPC requirements to know more on this.
    2. New VPC - If you don't have any existing VPC or want to deploy the Truefoundry EKS cluster inside a new VPC then you can select this option. In this option you will be prompted for the VPC CIDR which is the CIDR range of the VPC you want. If you are not sure 10.10.0.0/16 will be taken as default. You will also be asked for private and public CIDRS. Read New VPC requirements to know more on this
  3. Run the below command

    ocli infra init
  4. Screen will be cleared and you will be asked for cloud provider choice. Select aws here and for the next question add your account ID

    Truefoundry is a platform that makes it very easy to deploy microservices, ML models training jobs, LLMs on Kubernetes. We will start the process of bootstrapping a Kubernetes cluster. This CLI is useful only if you don't have a Kubernetes cluster. If you already have a cluster, please go to https://docs.truefoundry.com/docs/creating-your-own-kubernetes-cluster
    Let's get started!
    
    1. Cloud Provider
    In which cloud provider you would like to deploy your cluster: :
    >  aws
       azure
       gcp
    aws
    
    2. Account ID
    What is the AWS Account ID where you want to deploy your cluster:

🚧

exec: "aws": executable file not found in $PATH

The above error indicates that aws cli is not present in your local machine. Make sure you have downloaded the aws CLI.

🚧

GetLocalAWSProfiles: exit status 2

The above error indicates that the version of aws CLI is not matching the required version. For ocli to work aws == 2.x.x

❗️

Error: initCmd: aws.InitAws(): InitAWS: Error getting AWS profiles: inputAWSProfile: No profile found, atleast one profile must exist

This is an error when the CLI is not able to find any AWS profile in your local

  1. Select the right profile from the dropdown list of all the profile present in your local. You can use up and down arrow key and use / to search with keyword amongst the list. If you use aws configure to set up credentials then your profile name will be prompted as default

    3. AWS profiles
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    3(A). Which AWS profile you want to use ?
  2. Enter the name for you EKS cluster. A prefix of tfy and region in short form will added before all the resources that will be created in the cluster so you can avoid adding tfy in the cluster name itself

    4. AWS cluster name
    What is the cluster name that you want for your cluster (final name of your cluster will tfy-<SHORT_REGION>-<NAME>)
  3. Select the region from the dropdown list. You can use up and down arrow key alongwith / to toggle search with keyword.

    5. Region
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    5(A). In which region you want to deploy your cluster:
      eu-west-1
      me-central-1
      eu-central-2
      ap-northeast-3
    ↓ us-east-2
  4. Enter the no of availability zone and select the zones. Default (recommended) value is 3 with minimum being 2. In the below example I have selected 2 as the count of availability zones and I will get two options to select the zones from the 3 available ones.

    5(B). Avilability Zones
    Enter the number of availability zones (Default 3: 2 <= range <=3):  2
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    Select the availabiltity zone 1: 
      eu-west-1a
      eu-west-1b
      eu-west-1c
    
    # After selecting eu-west-1a
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    Select the availabiltity zone 2: 
      eu-west-1b
      eu-west-1c

Existing VPC

  1. Select existing when you want to deploy the cluster in an existing VPC, followed by inputting the VPC ID. Make sure the subnets have enough IP address and ideally should be in the range of less then /20 blocks.

    6. Network and VPC
    Do you want to create a new network or reuse an existing network: existing
  2. Select the VPC ID from the dop down list

    6(A). VPC ID
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    6(A). What is your existing VPC ID ??
      vpc-xxxxxxxxxxxxxxxxx
      vpc-xxxxxxxxxxxxxxxxx
  3. Select the private subnet IDs. Subnet IDs must be equal to the no of availability zones inputted before.

    Private subnets must be equal to the no of availability zones: 2. Skipping input for no of private subnets ...
    6(B). Private Subnet IDs
    Use the arrow keys to navigate: ↓ ↑ → ←  and / toggles search
    Select the availabiltity zone 1: 
      subnet-xxxxxxxxxxxxxxxxx
      subnet-xxxxxxxxxxxxxxxxx
      subnet-xxxxxxxxxxxxxxxxx
      subnet-xxxxxxxxxxxxxxxxx
    ↓ subnet-xxxxxxxxxxxxxxxxx
  4. Number of public subnet IDs can be different. You can use minimum to zero if you don't want the load balancer of EKS to be created in a public subnet. Here we will select 1 as we want to deploy a load balancer to host our external endpoints.

    6(C). Public Subnet IDs
    Enter the no of public subnets you want (Default: 2): 1
    
    
  5. Next enter the subnet IDs of both private and public subnets

    6(A). VPC ID
    What is you existing VPC ID: vpc-029827189eaa2c22e
    vpc-029827189eaa2c22e
    Below we will ask you to enter the subnet ID details for your existing VPC. We need total of 2 subnets, private and public each6(B). Private Subnet IDs
    
    Enter the ID private subnet 1: subnet-0be5bd498c2869c67
    Enter the ID private subnet 2: subnet-0321f13d89fce5bdf
    "subnet-0be5bd498c2869c67" "subnet-0321f13d89fce5bdf" 
    
    6(B). Public Subnet IDs
    
    Enter the ID of public subnet 1: subnet-0da043d78612040f3
    Enter the ID of public subnet 2: subnet-0cc42609184649379
    "subnet-0da043d78612040f3" "subnet-0cc42609184649379" 
  6. Config file will look something like

    aws:
      account:
        id: "xxxxxxxxxxxx"
      cluster:
        name: newcl
        public_access:
          cidrs:
            - 0.0.0.0/0
          enabled: true
        version: "1.28"
      iam_role:
        assume_role_arns:
          - arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps
        ecr:
          enabled: true
        enabled: true
        role_enable_override: false
        role_override_name: ""
        s3:
          bucket_enable_override: false
          bucket_override_name: ""
          enabled: true
        ssm:
          enabled: true
      network:
        existing: true
        private_subnets_cidrs: []
        private_subnets_ids:
          - subnet-xxxxxxxxxxxx
          - subnet-xxxxxxxxxxxx
          - subnet-xxxxxxxxxxxx
        public_subnets_cidrs: []
        public_subnets_ids:
          - subnet-xxxxxxxxxxxx
          - subnet-xxxxxxxxxxxx
        vpc_cidr: ""
        vpc_id: vpc-xxxxxxxxxxxx
      profile:
        name: administrator-devtest
      region:
        availability_zones:
          - us-east-1a
          - us-east-1b
          - us-east-1c
        name: us-east-1
      tags: {}
    azure: null
    binaries:
      terraform:
        binary_path: null
      terragrunt:
        binary_path: null
    gcp: null
    provider: aws

New VPC (Recommended)

  1. Select new when you want to deploy the cluster in a new VPC, followed by your expected CIDR range. If you press enter 10.10.0.0/16 will be selected as default and then subnets will be automatically selected.

  2. If you chose a different CIDR range for your VPC you have to select the subnet CIDR explicitly.

    6(A). VPC CIDR
    What should be the CIDR for your new VPC (Default: 10.10.0.0/16. Chose a range between /8 and /24): 10.20.0.0/16
    10.20.0.0/16
    Below we will ask you to enter the subnet CIDR details for your new VPC. We need to create total of 3 subnets for each availability zones
    
    6(B). Private Subnet CIDRS
    
    Enter the CIDR of private subnet 1: 10.20.0.0/20
    Enter the CIDR of private subnet 2: 10.20.16.0/20
    Enter the CIDR of private subnet 3: 10.20.32.0/20
    "10.20.0.0/20" "10.20.16.0/20" "10.20.32.0/20" 
    6(C). Public Subnet CIDRS
    
    Enter the CIDR of public subnet 1: 10.20.128.0/20
    Enter the CIDR of public subnet 2: 10.20.144.0/20
    Enter the CIDR of public subnet 3: 10.20.160.0/20
  3. For new VPC config file will look something like this

    aws:
      account:
        id: "xxxxxxxxxxxx"
      cluster:
        name: clusterxyz
        public_access:
          cidrs:
            - 0.0.0.0/0
          enabled: true
        version: "1.28"
      iam_role:
        assume_role_arns:
          - arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps
        ecr:
          enabled: true
        enabled: true
        role_enable_override: false
        role_override_name: ""
        s3:
          bucket_enable_override: false
          bucket_override_name: ""
          enabled: true
        ssm:
          enabled: true
      network:
        existing: false
        private_subnets_cidrs:
          - 10.10.0.0/20
          - 10.10.16.0/20
          - 10.10.32.0/20
        private_subnets_ids: []
        public_subnets_cidrs:
          - 10.10.176.0/20
          - 10.10.192.0/20
          - 10.10.208.0/20
        public_subnets_ids: []
        vpc_cidr: 10.10.0.0/16
        vpc_id: ""
      profile:
        name: admin
      region:
        availability_zones:
          - us-east-1a
          - us-east-1b
          - us-east-1c
        name: us-east-1
      tags: {}
    azure: null
    binaries:
      terraform:
        binary_path: null
      terragrunt:
        binary_path: null
    gcp: null
    provider: aws

IAM Role section (new)

Considering the security best practices we have removed the creation of user which used to have access to your account's ECR, S3 bucket and SSM and replaced that with an IAM role utilizing cross account IAM role which is password-less.

Now, we create an IAM role which allows assumeRole on Truefoundry's production IAM role arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps. If you are using Truefoundry's control plane then you can leave this IAM role ARN as it is. However, if you are using your own control plane, then you can add the control plane IAM roles in the aws.iam_role.assume_role_arns section.

To override the name of the IAM role that will get created in your account change the following settings in your tfy-config.yaml file

role_enable_override: true
role_override_name: "<your-preferred-role-name>"

You can disable the role to have access to ecr, ssm or S3 bucket. In case of S3 bucket (if disabled), the bucket will also not get created. Following changes will disable ECR and S3 bucket but SSM will be kept enabled

    iam_role:
        assume_role_arns:
            - arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps
        ecr:
            enabled: false
        enabled: true
        role_enable_override: false
        role_override_name: ""
        s3:
            bucket_enable_override: false
            bucket_override_name: ""
            enabled: false
        ssm:
            enabled: true

Moreover for S3 bucket, a default name will be given which can be overriden using the following parameters in the aws.iam_role.s3 section

s3:
  bucket_enable_override: true
  bucket_override_name: "<your-preferred-bucket-name>"
  enabled: true

You can entirely disable creation of the IAM role by keeping aws.iam_role.enabled as false.

Applying tags

There is a common requirement amongst customers to tag all the resources created by truefoundry and for this a section of tags: {} is given to apply key-value pairs on all the resources deployed by TrueFoundry. An example of this section

tags:
  Owner: TrueFoundry
  Email: [email protected]
  Purpose: LLM
  Risk: Low

Running the config file

  1. Run the config file by
    ocli infra create --file tfy-config.yaml

Post cluster-creation steps

Saving the output

The above process generates some output which are helpful for deployment of some applications. For this save the output in some file

ocli infra output --file tfy-config.yaml > output.txt

Downloading the kubeconfig file

  1. The CLI downloads kubectl if it is not present by default. We need to use the aws CLI to download the kubectl
    export AWS_REGION=""
    export CLUSTER_NAME=""
    export AWS_PROFILE=""
  2. Download the kubeconfig file
    aws eks --region $AWS_REGION update-kubeconfig --name $CLUSTER_NAME --profile $AWS_PROFILE

Connecting the cluster to the platform

Once your cluster is created we need to run a second step to install the truefoundry-agent which will connect the cluster to the control plane.

  1. If you haven't registered for TrueFoundry yet, follow along the doc to register your company.
  2. Once you have logged head over to Bring Your Own Cluster section to add the cluster details.
  3. Copy the bash script given in the tab and execute the script. This part will be replaced by ocli script soon.