Following is the list of requirements to set up compute plane in your GCP project

GCP Infra Requirements

New VPC + New Cluster

These are the requirements for a fresh Truefoundry installation. If you are reusing an existing network or cluster, refer to the sections further below, in addition to this one

RequirementsDescriptionReason for requirement
VPCAny existing or new VPC with the following subnets should work:- Minimum subnet CIDR should be /24
- Additional ranges:

* Pods: min /20 CIDR
* Services: min /22 CIDR
- Cloud Router, Cloud NAT
- Port 80/443 ingress from internet
- Allow all ports ingress inside a subnet
- Port 443, 8443, 9443 and 15017 for connection to GKE master control plane
This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased. Cloud Router and NAT are required for egress internet access.
Egress access For Docker Registry1. public.ecr.aws 2. quay.io 3. ghcr.io 4. docker.io/truefoundrycloud 5. docker.io/natsio 6. nvcr.io 7. registry.k8s.ioThis is to download docker images for Truefoundry, ArgoCD, NATS, GPU operator, ArgoRollouts, ArgoWorkflows, Istio, Keda.
DNS with SSL/TLSSet of endpoints (preferably wildcard) to point to the deployments being made. Something like .internal.example.com,.external.example.com. An ACM certificate with the chose domains as SAN is required in the same regionWhen developers deploy their services, they will need to access the endpoints of their services to test it out or call from other services. This is why we need the second domain name. Its better if we can make it a wildcard since then developers can deploy services like service1.internal.example.com, service2.internal.example.com. We also support path based routing which would make the endpoints internal.example.com/service1 and internal.example.com/service2
Compute- Quotas must be enabled for required CPU and GPU instance types (on-demand and preemptible / spot)This is to make sure TrueFoundry can bring up the machines as needed.
User/ServiceAccount to provision the infrastructure- Compute Admin
- Compute Network Admin
- Kubernetes Engine Admin
- Security Admin
- Service Account Admin
- Service Account Token Creator
- Service Account User
- Storage Admin
- Service usage Admin
These are the permissions required by the IAM user in GCP to create the entire compute-plane infrastructure.

Existing Network

RequirementsDescriptionReason for requirement
VPC* Minimum subnet CIDR should be /24
* Additional ranges:
* Pods: min /20 CIDR
* Services: min /22 CIDR
* Cloud Router, Cloud NAT
* Port 80/443 ingress from internet
* Allow all ports ingress inside a subnet
* Port 443, 8443, 9443 and 15017 for connection to GKE master control plane
This is needed to ensure around 250 instances and 4096 pods can be run in the Kubernetes cluster. If we expect the scale to be higher, the subnet range should be increased. Cloud Router and NAT are required for egress internet access.

Existing Cluster

RequirementsDescriptionReason for requirement
GKE Version* GKE version 1.30 or later (recommended)Recommended for latest security features, better performance and stability.
Node Auto Provisioning (NAP)- Must be enabled and configured with appropriate resource limits
- Recommended minimum limits:
* CPU: 1000 cores
* Memory: 4000 GB
- GPU quotas (if using accelerators)
Required for automatic node provisioning based on workload demands. This ensures efficient resource allocation and scaling for ML workloads without manual intervention.
Workload Identity* Must be enabled on the cluster
* Configured with PROJECT_ID.svc.id.goog workload pool
* Node pools must have Workload Identity enabled
Required for secure service account access to GCP resources. Eliminates the need for service account keys and provides fine-grained access control for Kubernetes workloads.

Node Auto Provisioning Configuration

Node Auto Provisioning (NAP) needs to be properly configured with resource limits and defaults. Here’s how to configure NAP:

To enable NAP on your existing GKE cluster:

gcloud container clusters update CLUSTER_NAME \
    --enable-autoprovisioning \
    --min-cpu 0 \
    --max-cpu 1000 \
    --min-memory 0 \
    --max-memory 10000 \
    --location=REGION

For GPU support, add GPU resource limits:

gcloud container clusters update CLUSTER_NAME \
    --enable-autoprovisioning \
    --autoprovisioning-resource-limits=nvidia-tesla-t4=0:256 \
    --location=REGION

You can also enable NAP through the Google Cloud Console:

  1. Go to Google Kubernetes Engine > Clusters
  2. Select your cluster and click “Edit”
  3. Under “Features” find “Node Auto-Provisioning” and enable it
  4. Set resource limits for CPU and memory
  5. Click “Save”

For more details, see Google’s official documentation on Node Auto-Provisioning.

Enable Workload Identity

Run the following commands to enable workload identity:

# Enable Workload Identity feature
gcloud container clusters update CLUSTER_NAME \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --region=REGION

# Enable Workload Identity on the node pool
gcloud container node-pools update NODE_POOL_NAME \
    --cluster=CLUSTER_NAME \
    --workload-identity-config=workload-pool=PROJECT_ID.svc.id.goog \
    --region=REGION

Permissions required to create the infrastructure

The IAM user should have the following permissions

  • Compute Admin
  • Compute Network Admin
  • Kubernetes Engine Admin
  • Security Admin
  • Service Account Admin
  • Service Account Token Creator
  • Service Account User
  • Storage Admin
  • Service usage Admin