Requirements

Following is the list of requirements to set up compute plane in your GCP project

GCP Infra Requirements

New VPC + New Cluster

Below are the requirements for a new VPC + new cluster.

Requirements	Description	Reason for requirement
VPC	- Minimum subnet CIDR should be /24 - Additional ranges: * Pods: min /20 CIDR * Services: min /24 CIDR - Cloud Router, Cloud NAT - Port 80/443 ingress from internet - Allow all ports ingress inside a subnet - Port 443, 8443, 9443 and 15017 for connection to GKE master control plane	This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased accordingly. Pod and subnet secondary range can be non-routable as well. Cloud Router and NAT are required for egress internet access.
Egress access For Docker Registry	- `public.ecr.aws` - `quay.io` - `ghcr.io` - `tfy.jfrog.io` - `docker.io/natsio` - `nvcr.io` - `registry.k8s.io`	This is to download docker images for Truefoundry, ArgoCD, NATS, GPU operator, ArgoRollouts, ArgoWorkflows, Istio, Keda.
DNS	Domain for service endpoints	Examples: `.internal.example.com`, `.external.example.com`, `tfy.example.com`. Wildcard preferred for developer service deployments
Certificate	Certificate for the domains	Required for terminating TLS traffic to the services. Check here for more details
Compute	- Quotas must be enabled for required CPU and GPU instance types (on-demand and preemptible / spot)	This is to make sure TrueFoundry can bring up the machines as needed. Check compute quotas for more details.
User/ServiceAccount to provision the infrastructure	Permission to run terraform	These are the permissions required by the IAM user in GCP to create the entire compute-plane infrastructure. Check here for more details.

Existing Network

Requirements	Description	Reason for requirement
VPC	- Minimum subnet CIDR should be /24 - Additional ranges: * Pods: min /20 CIDR * Services: min /24 CIDR - Cloud Router, Cloud NAT - Port 80/443 ingress from internet - Allow all ports ingress inside a subnet - Port 443, 6443, 8443, 9443 and 15017 for connection to GKE master control plane	This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased. Pod and subnet secondary range can be non-routable as well. Cloud Router and NAT are required for egress internet access.

Existing Cluster

Requirements	Description	Reason for requirement
GKE Version	GKE version 1.30 or later (recommended)	Recommended for latest security features, better performance and stability.
Node Auto Provisioning (NAP)	- Must be enabled and configured with appropriate resource limits - Recommended minimum limits: * CPU: 1000 cores * Memory: 4000 GB - GPU quotas (if using accelerators)	Required for automatic node provisioning based on workload demands. This ensures efficient resource allocation and scaling for ML workloads without manual intervention.
Workload Identity	- Must be enabled on the cluster - Configured with PROJECT_ID.svc.id.goog workload pool - Node pools must have Workload Identity enabled	Required for secure service account access to GCP resources. Eliminates the need for service account keys and provides fine-grained access control for Kubernetes workloads.

Node Auto Provisioning Configuration

Node Auto Provisioning (NAP) needs to be properly configured with resource limits and defaults. Here’s how to configure NAP:

To enable NAP on your existing GKE cluster:

gcloud container clusters update CLUSTER_NAME \
    --enable-autoprovisioning \
    --min-cpu 0 \
    --max-cpu 1000 \
    --min-memory 0 \
    --max-memory 10000 \
    --location=REGION

For GPU support, add GPU resource limits:

gcloud container clusters update CLUSTER_NAME \
    --enable-autoprovisioning \
    --autoprovisioning-resource-limits=nvidia-tesla-t4=0:256 \
    --location=REGION

You can also enable NAP through the Google Cloud Console:

Go to Google Kubernetes Engine > Clusters
Select your cluster and click “Edit”
Under “Features” find “Node Auto-Provisioning” and enable it
Set resource limits for CPU and memory
Click “Save”

For more details, see Google’s official documentation on Node Auto-Provisioning.

Enable Workload Identity

Run the following commands to enable workload identity:

# Enable Workload Identity feature
gcloud container clusters update CLUSTER_NAME \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --region=REGION

# Enable Workload Identity on the node pool
gcloud container node-pools update NODE_POOL_NAME \
    --cluster=CLUSTER_NAME \
    --workload-identity-config=workload-pool=PROJECT_ID.svc.id.goog \
    --region=REGION

Permissions required to create the infrastructure

The IAM user should have the following permissions

Compute Admin
Compute Network Admin
Kubernetes Engine Admin
Security Admin
Service Account Admin
Service Account Token Creator
Service Account User
Storage Admin
Service usage Admin

Common Required APIs

Following APIs are to be enabled in your Google project. Check How to enable APIs in a Google Cloud project.

Compute Engine API - This API must be enabled for Virtual Machines
Kubernetes Engine API - This API must be enabled for Kubernetes clusters
Storage Engine API - This API must be enabled for GCP Blob storage - Buckets
Artifact Registry API - This API must be enabled for docker registry and image builds
Secrets Manager API - This API must be enabled to support Secret management
Service Account key creation - This API must be enabled to support Service Account key creation

Architecture Setting Up DNS And TLS In GCP

On this page

GCP Infra Requirements
New VPC + New Cluster
Existing Network
Existing Cluster
Node Auto Provisioning Configuration
Enable Workload Identity
Permissions required to create the infrastructure
Common Required APIs

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Secret Management

Access Control

Platform

Deploying On Your Own Cloud

HOW TOs

GCP Infra Requirements

New VPC + New Cluster

Existing Network

Existing Cluster

Node Auto Provisioning Configuration

Enable Workload Identity

Permissions required to create the infrastructure

Common Required APIs

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Secret Management

Access Control

Platform

Deploying On Your Own Cloud

HOW TOs

​GCP Infra Requirements

​New VPC + New Cluster

​Existing Network

​Existing Cluster

​Node Auto Provisioning Configuration

​Enable Workload Identity

​Permissions required to create the infrastructure

​Common Required APIs

GCP Infra Requirements

New VPC + New Cluster

Existing Network

Existing Cluster

Node Auto Provisioning Configuration

Enable Workload Identity

Permissions required to create the infrastructure

Common Required APIs