GCP Infra Requirements

For clusters that are to be onboarded there are certain requirements which must be fulfilled. These requirements vary from network, CPU, GPUs and access.

Common requirements

Following things must be installed. However if you are using Onboarding CLI then these will be installed by default -
- kubectl >=1.25+
- helm >=3.10

Required Google APIs

Following APIs are to be enabled in your google project. Check How to enable APIs in a Google Cloud project. If you are using Onboarding CLI to create the GKE cluster then these APIs will be enabled automatically.

Compute Engine API - This API must be enabled for Virtual Machines
Kubernetes Engine API - This API must be enabled for Kubernetes clusters
Storage Engine API - This API must be enabled for GCP Blob storage - Buckets
Artifact Registry API - This API must be enabled for docker registry and image builds
Secrets Manager API - This API must be enabled to support Secret management

Network requirements

You can either use an existing network or a new network (recommended) to deploy your cluster in GCP.

For existing VPC network
- Enough IPs must be free in your VPC. Check how to create a cluster in an existing network
- Your existing subnet should contain a secondary range for pod and services separetely. CIDR of these ranges should atleast be less than /20 and less than /24 respectively. Check GCP Subnets to know more on this.
- A cloud router with a cloud NAT gateway must be created in the VPC so as to allow internet connection.
- A VPC by default has a route for internet access. If the route is not present, then a new route must be created to allow internet access through the NAT gateway.
For new VPC network
- You will be asked for the input of CIDR for you new subnet. Make sure to chose a range which allow /16 for your subnet. Default value is 10.10.0.0/16. Check how to create a cluster in a new network
- You will be asked for the input secondary range pods and services . Make sure to chose a range of /16 for pods and /20 for services at the least. Default value for pods is 10.244.0.0/16 and for services is 10.255.0.0/16
GKE by defaults assign 256 IP addresses to a node, so there must be enough amount of IP address available to spin multiple nodes. A CIDR of /24 can have maximum of 4 nodes so you need to have a bigger subnets and secondary ranges to allow your cluster to autoscale to good amount.
For an existing GKE cluster all ports must be connecting from master to nodes. This is required for Istio(selected ports) and few other applications.
Following requirements must be met for DNS
- There is minimum one endpoint for workload cluster required for hosting the ML workloads. It can be wildcard at sub domain or sub-sub domain level. For e.g. (*.ml.<org-name>.com, *.tfy.<org-name>.com or *.tfy-apps.<org-name>.com etc.). The endpoint is not required to be wildcard only, we support endpoint based routing along with sub domain based routing.
- When Istio will be deployed then a load balancer IP address will come up which needs to be mapped to the domain for your workloads in your DNS provider.
- TLS/SSL termination can happen in two ways in GCP
  - Using cert-manager - cert-manager can be installed in the cluster which can then talk to your DNS provider to create DNS challenges in order to create secrets in the cluster which can be then be used by Isito Gateway
  - Certificate and key-pair file - Raw certificate file can also be used. For this a secret must be created which contains the TLS certificate. Refer here for more details on this. Secret then can be passed in Istio Gateway configuration.

Compute requirements

Compute requirements refers to the amount of compute (CPU/GPU/memory) that is available for use in your region.
Make sure you have enough quotas for your compute requirements. Understand How quotas work in GCP and create proper quota request so that your applications can run and autoscale smoothly.

Make sure to setup the node-auto provisioning to allow GPU and CPU nodes. Below is a sample of NAP config file that can be used. If you are using Onboarding CLI then this will be created automatically for you. Make sure to change the region

nap.yaml - These is just an example but can be used to configure the resources accordingly.

resourceLimits:
  - resourceType: 'cpu'
    minimum: 0
    maximum: 1000
  - resourceType: 'memory'
    minimum: 0
    maximum: 10000
  - resourceType: 'nvidia-tesla-k80'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-tesla-p100'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-tesla-p4'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-tesla-v100'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-tesla-t4'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-tesla-a100'
    minimum: 0
    maximum: 8
  - resourceType: 'nvidia-a100-80gb'
    minimum: 0
    maximum: 4
  - resourceType: 'nvidia-l4'
    minimum: 0
    maximum: 4
autoprovisioningLocations:
  - REGION-a
  - REGION-c
  - REGION-f
management:
  autoRepair: true
  autoUpgrade: true
shieldedInstanceConfig:
  enableSecureBoot: true
  enableIntegrityMonitoring: true
diskSizeGb: 300
diskType: pd-balanced

Execute the below command to update the existing cluster

gcloud container clusters update CLUSTER_NAME \
   --enable-autoprovisioning \
   --region REGION \
   --autoprovisioning-config-file nap.yaml

Authentication requirements

To create a Kubernetes cluster and all the required resources. you must have the following criteria accomplished

A valid project with billing enabled
If you are using Onboarding CLI to create the cluster you must have contributor or owner level access in your project.
Service Account key creation should be allowed for the required service accounts. This is necessary for integrations and automation that require key-based authentication.

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Secret Management

Platform

Deploying On Your Own Cloud

HOW TOs

Common requirements

Required Google APIs

Network requirements

Compute requirements

Authentication requirements

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Secret Management

Platform

Deploying On Your Own Cloud

HOW TOs

​Common requirements

​Required Google APIs

​Network requirements

​Compute requirements

​Authentication requirements

Common requirements

Required Google APIs

Network requirements

Compute requirements

Authentication requirements