Deploy Compute Plane

Connect Kubernetes Cluster in your cloud account to Truefoundry Control Plane

In this mode, we will be connecting a Kubernetes cluster to the Truefoundry control plane. The steps will vary slightly according based on your cloud provider - AWS, GCP or Azure. The key outline of the steps that need to be done are mentioned below. The entire setup process takes between 30 mins to 1 hour if you are using our onboarding cli script.

1. Connect your Kubernetes Cluster

Create a new Kubernetes Cluster using our onboarding script (HIGHLY RECOMMENDED)

We highly recommend to create a new cluster using our onboarding script that will setup all the correct permissions and configuration for you and perform all the subsequent steps automatically for you. The onboarding script (ocli) basically asks you for the configuration of the cluster and executes Terraform code to create the cluster. It uses Terragrunt code underneath to keep the Terraform configuration dry. You can view the complete Terraform code while executing the ocli script.

You can check the onboarding CLI document here for installation and brief overview.

Onboard an existing Kubernetes cluster

While its possible to onboard an existing cluster, we will need to make sure that the cluster has the recommended settings and the components installed for everything to work correctly. You can follow the guides for AWS, GCP or Azure to connect your existing cluster.

2. Connect your Docker Registry

You need to integrate your docker registry so that the control plane can push the docker images to your registry after building the source code. TrueFoundry providers integrations for all the major docker registries. Follow the guides below to integrate your docker registry.

  1. AWS ECR
  2. Google Artifact Registry
  3. Azure Container Registry
  4. DockerHub
  5. Quay

If you are using the Onboarding CLI for all the major cloud providers will be created by default. You just need to check the output of the onboarding CLI to connect the docker registry for AWS, GCP and Azure.

3. Connect your Blob Storage (AWS S3 / GCS Bucket / Azure Container)

We need to integrate with atleast one blob storage bucket to store the ML models and artifacts. You can integrate multiple blob storage buckets to segregate development and production environments. If you are using onboarding CLI then a bucket in the respective cloud provider is created by default. To access these buckets a role is also created which can access it.

4. Connect Your Secret Store (OPTIONAL)

You can integrate your secret store with Truefoundry so that developers can save the secrets and use them in their applications. If a secret store is integrated, all secrets are stored in your actual Secret Store and Truefoundry only store the link to the secret in its own databases. This makes sure that your secrets are not stored in the ControlPlane. Follow the guides below to integrate with the following secret store:

  1. AWS SSM
  2. Google Secrets Manager
  3. Azure Vault

If you are using onboarding CLI then access to SSM is created through a role which can access it.

5. Connect Git Repository (OPTIONAL)

Integrating with the Git repository allows developers to deploy directly from their Git repository by specifying the repository name, branch and commit SHA. Follow the guides below to integrate with the following Git repositories.

Install Application Components

Truefoundry requires some open-source components to be installed on the cluster to be able to deploy services and models. If you are using OCLI script bootstrap / create the cluster, it will automatically setup all the components for you.

You can install the applications or view the installed applications on each of the cluster in Truefoundry.

The set of mandatory dependencies are:

  1. ArgoCD, Argo Rollout : TrueFoundry relies on ArgoCD Application object to deploy the applications inside a Kubernetes cluster. The infra applications are deployed in the default project in argocd while the user deployed applications are deployed in tfy-apps project. ArgoRollouts is used to power the rollout process of deployments - enabling blue-green, canary and different rollout strategies.
  2. Argo Workflows : Truefoundry uses ArgoRollouts to run the jobs deployed on the platform.
  3. Istio : Istio is used as the Ingress controller and also to power functionalities of Oauth authentication for Notebooks and traffic shaping abilities. Truefoundry doesn't impose the sidecar inject by default - its only done if we try to do request count based autoscaling or are trying to mirror or intercept traffic.

The rest of the dependencies are more use-case based and optional depending on if you are using that feature.

  1. Keda for workload Autoscaling: Truefoundry uses Keda to autoscale your workloads based on time, requests count, CPU or queue length.
  2. Prometheus for Metrics: Prometheus powers the metrics dashboard in Truefoundry. Prometheus also helps provide some of the metrics for autoscaling.
  3. Loki for Logs: The logs feature in Truefoundry is powered via Loki. This is optional and you can choose to provide your own logging solution.
  4. GPU operator : This is Truefoundry provided helm chart for brining up GPU nodes in different clouds. Its based on Nvidia's GPU operator.
  5. Grafana : You can install the Truefoundry grafana helm chart that comes with a lot of inbuilt dashboards for cluster monitoring.

AWS Specific Components:

  1. Metrics-Server: This is required on AWS EKS cluster for metrics collection.
  2. AWS Ebs CSI Driver : This is required for supporting EBS volumes on EKS cluster.
  3. AWS Efs CSI Driver : Required for supporting EFS volumes for EFS cluster.
  4. TFY Inferentia operator : This is required for supporting Inferentia machines on EKS.

Azure Specific Components:

  1. Cert-Manager : This is needed for provisioning certificates on Azure AKS cluster.