GCP Infra Requirements
For clusters that are to be onboarded there are certain requirements which must be fulfilled. These requirements vary from network, CPU, GPUs and access.
Control plane requirements
Below requirements exist for both workload and control plane clusters. However, the requirements which assumes an existing GKE cluster are not valid for Control plane. Control plane clusters require setup from Truefoundry team.
Common requirements
- If cluster is to be setup at client's end using
terraform
/terragrunt
then following things must be installed- kubectl >=1.25
- helm >=3.10
- terraform == 1.4.x
- terragrunt == 0.46.x
- If control plane installation is there
Required Google APIs
- Compute Engine API - This API must be enabled for Virtual Machines
- Kubernetes Engine API - This API must be enabled for Kubernetes clusters
- Storage Engine API - This API must be enabled for GCP Blob storage - Buckets
- Artifact Registry API - This API must be enabled for docker registry and image builds
- Secrets Manager API - This API must be enabled to support Secret management
Network requirements
For existing VPC network enough IPs must be free
- GKE by defaults assign 256 IP addresses to a node, so there must be enough amount of IP address available to spin multiple nodes. A CIDR of
/24
can have maximum of 4 nodes - For an existing GKE cluster all ports must be connecting from master to nodes. This is required for Istio(selected ports) and few other applications.
- Following requirements must be met for DNS
- There are minimum two endpoints that Truefoundry requires for control plane cluster and minimum one endpoint for workload cluster
- Control plane
- One endpoint for hosting the control plane UI. It can be something like
truefoundry.<orgname>.ai
orplatform.<org-name>.com
- One endpoint for hosting the ML workloads. It should be wildcard at sub domain or sub-sub domain level. For e.g. (
*.ml.<org-name>.com
)
- One endpoint for hosting the control plane UI. It can be something like
- Workload (only)
- One endpoint for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (
*.ml.<org-name>.com
). The endpoint is not required to be wildcard only, we support endpoint based routing along with sub domain based routing.
- One endpoint for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (
- If Istio is already deployed then make sure the
host
field is set in Istio Gateway to the endpoint which you want your workloads to expose (publicly). This endpoint must then be passed in the workload cluster from the UI. - If
Istio
is not deployed then a load balancer address will come up whenIstio
gets installed in the cluster during onboarding. The value of theloadbalancer's
IP must be mapped as aA
record to the endpoint where your workload will be hosted (publicly). - TLS/SSL termination can happen in two ways in GCP
- Using cert-manager -
cert-manager
can be installed in the cluster which can then talk to your DNS provider to create DNS challenges in order to create secrets in the cluster which can be then be used by Isito Gateway - Certificate and key-pair file - Raw certificate file can also be used. For this a secret must be created which contains the TLS certificate. Refer here for more details on this. Secret then can be passed in
Istio Gateway
configuration.
- Using cert-manager -
Compute requirements
- Compute requirements refers to the amount of compute (CPU/GPU/memory) that is available for use in your region.
- Make sure to setup the node-auto provisioning to allow GPU and CPU nodes. Below is a sample of NAP config file that can be used. Make sure to change the region
nap.yaml
- These is just an example but can be used to configure the resources accordingly.resourceLimits: - resourceType: 'cpu' minimum: 0 maximum: 1000 - resourceType: 'memory' minimum: 0 maximum: 10000 - resourceType: 'nvidia-tesla-k80' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-p100' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-p4' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-v100' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-t4' minimum: 0 maximum: 4 - resourceType: 'nvidia-tesla-a100' minimum: 0 maximum: 8 - resourceType: 'nvidia-a100-80gb' minimum: 0 maximum: 4 - resourceType: 'nvidia-l4' minimum: 0 maximum: 4 autoprovisioningLocations: - REGION-a - REGION-c - REGION-f management: autoRepair: true autoUpgrade: true shieldedInstanceConfig: enableSecureBoot: true enableIntegrityMonitoring: true diskSizeGb: 300 diskType: pd-balanced
- Execute the below command to update the existing cluster
gcloud container clusters update CLUSTER_NAME \ --enable-autoprovisioning \ --autoprovisioning-config-file nap.yaml
Updated about 1 month ago