Requirements
Requirements for Truefoundry installation on AWS
Following is the list of requirements to set up compute plane in your AWS account
AWS Infra Requirements
New VPC + New Cluster
These are the requirements for a fresh Truefoundry installation. If you are reusing an existing network or cluster, refer to the sections further below, in addition to this one
Requirements | Description | Reason for Requirement |
---|---|---|
AWS Account | Billing must be enabled for the AWS account. | |
VPC |
| This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased. A NAT Gateway must be connected in the VPC and the route tables should allow outbound internet access for private subnets through this NAT gateway. |
Egress access For Docker Registry | 1. public.ecr.aws | This is to download docker images for TrueFoundry, ArgoCD, NATS, GPU operator, ArgoRollouts, ArgoWorkflows, Istio, Keda. |
DNS with SSL/TLS | Set of endpoints (preferably wildcard) to point to the deployments being made. Something like .internal.example.com,.external.example.com. An ACM certificate with the chosen domains as SAN is required in the same region | When developers deploy their services, they will need to access the endpoints of their services to test it out or call from other services. This is why we need the DNS along with TLS on the compute plane. Its better if we can make it a wildcard since then developers can deploy services like service1.internal.example.com, service2.internal.example.com |
ACM Certificate | We need to have a certificate for the domains listed above. The certificate ARN will be passed to the Istio Ingress config. | If you have a certificate from some other source, that can also work by creating a secret with the certificate in |
Cloud Quotas | GPU
| This is to make sure that TrueFoundry can bring up the instances as requested by developers. A request needs to be raised to AWS for increasing the limits for instances in case we don't have quotas. You can check and increase your quotas at AWS EC2 service quotas |
User / ServiceAccount to provision the cluster | 1. |
Existing network
Requirements | Description | Reason for Requirement |
---|---|---|
VPC |
| This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased. A NAT Gateway must be connected in the VPC and the route tables should allow outbound internet access for private subnets through this NAT gateway. |
VPC Tags
Your subnets must have the following tags for the Truefoundry terraform code to work with them. You can skip it if you are creating a new network in which case these will automatically be created.
Private Subnets
"kubernetes.io/cluster/${clusterName}": "shared"
"subnet": "private"
"kubernetes.io/role/internal-elb": "1"
Public Subnets
"kubernetes.io/cluster/${clusterName}": "shared"
"subnet": "public"
"kubernetes.io/role/elb": "1"
Existing cluster
Requirements | Description | Reason for Requirement |
---|---|---|
Compute | CPU
|
Permissions required to create the infrastructure
Following policy must be attached to the IAM user being used to execute terraform
export REGION="" # us-east-1
export SHORT_REGION="" #usea1
export ACCOUNT_ID="" #123524493244
export NAME=""
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"eks:*",
"elasticfilesystem:*",
"kms:*",
"route53:AssociateVPCWithHostedZone",
"sts:GetCallerIdentity",
"iam:GetRole"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:*"
],
"Resource": "arn:aws:dynamodb:$REGION:$ACCOUNT_ID:table/$NAME-$REGION-tfy-ocli-table"
},
{
"Effect": "Allow",
"Action": [
"iam:AddRoleToInstanceProfile",
"iam:CreateInstanceProfile",
"iam:DeleteInstanceProfile",
"iam:GetInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:TagInstanceProfile"
],
"Resource": "arn:aws:iam::$ACCOUNT_ID:instance-profile/*"
},
{
"Effect": "Allow",
"Action": [
"iam:CreateOpenIDConnectProvider",
"iam:DeleteOpenIDConnectProvider",
"iam:GetOpenIDConnectProvider",
"iam:TagOpenIDConnectProvider"
],
"Resource": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/*"
},
{
"Effect": "Allow",
"Action": [
"iam:CreatePolicy",
"iam:DeletePolicy",
"iam:GetPolicy",
"iam:TagPolicy",
"iam:GetPolicyVersion",
"iam:ListPolicyVersions"
],
"Resource": [
"arn:aws:iam::$ACCOUNT_ID:policy/tfy-*",
"arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_Karpenter_Controller_Policy*",
"arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_CNI_Policy*"
]
},
{
"Effect": "Allow",
"Action": [
"iam:*"
],
"Resource": [
"arn:aws:iam::$ACCOUNT_ID:role/tfy-*",
"arn:aws:iam::$ACCOUNT_ID:role/initial-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::tfy-$SHORT_REGION-$NAME-ml*",
"arn:aws:s3:::tfy-$SHORT_REGION-$NAME-ml*/*",
"arn:aws:s3:::$NAME-$REGION-tfy-ocli-bucket",
"arn:aws:s3:::$NAME-$REGION-tfy-ocli-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"events:*"
],
"Resource": "arn:aws:events:$REGION:$ACCOUNT_ID:rule/tfy-$SHORT_REGION-$NAME*"
},
{
"Effect": "Allow",
"Action": [
"sqs:*"
],
"Resource": "arn:aws:sqs:$REGION:$ACCOUNT_ID:tfy-$SHORT_REGION-$NAME-karpenter"
}
]
}
Updated about 1 month ago