In this mode, we will be connecting a Kubernetes cluster to the TrueFoundry control plane. The steps will vary based on your cloud provider - AWS, GCP or Azure. The key outline of the steps that need to be done are mentioned below. The entire setup process takes between 30 mins to 1 hour if you are using our terraform code.

1. Connect your Kubernetes Cluster

You can choose to either create or onboard an existing cluster to TrueFoundry platform. You can follow the guides for AWS, GCP or Azure to connect your cluster.

2. Connect your Docker Registry

You need to integrate your docker registry so that the control plane can push the docker images to your registry after building the source code. TrueFoundry providers integrations for all the major docker registries. Follow the guides below to integrate your docker registry.

  1. AWS ECR
  2. Google Artifact Registry
  3. Azure Container Registry
  4. DockerHub
  5. Quay

If you are using the TrueFoundry generated terraform code, docker registry for all the major cloud providers will be created by default and attached to the control plane. Access to the docker registry is created through a role which can access it.

3. Connect your Blob Storage (AWS S3 / GCS Bucket / Azure Container)

You need to integrate with a blob storage bucket to store the ML models and artifacts. You can integrate multiple blob storage buckets to segregate development and production environments. If you are using TrueFoundry generated terraform code then a bucket in the respective cloud provider is created by default. Access to the bucket is created through a role which can access it.

4. Connect Your Secret Store (OPTIONAL)

You can integrate your secret store with TrueFoundry so that developers can save the secrets and use them in their applications. If a secret store is integrated, all secrets are stored in your actual Secret Store and TrueFoundry only store the link to the secret in its own databases. This makes sure that your secrets are not stored in the ControlPlane. Follow the guides below to integrate with the following secret store:

  1. AWS SSM
  2. Google Secrets Manager
  3. Azure Vault

If you are using TrueFoundry generated terraform code then access to SSM is created through a role which can access it.

5. Connect Git Repository (OPTIONAL)

Integrating with the Git repository allows developers to deploy directly from their Git repository by specifying the repository name, branch and commit SHA. Follow the guides below to integrate with the following Git repositories.

  1. GitHub
  2. GitLab
  3. Bitbucket
  4. Azure Repos

Install Application Components

TrueFoundry requires some open-source components to be installed on the cluster to be able to deploy services and models. If you are using the TrueFoundry generated terraform code, it will automatically setup all the components for you.

You can install the applications or view the installed applications on each of the cluster in Truefoundry.

The set of mandatory dependencies are:

  1. ArgoCD : TrueFoundry relies on ArgoCD to deploy applications inside a Kubernetes cluster. The infra applications are deployed in the default project in argocd while the user deployed applications are deployed in tfy-apps project. If you are using your own ArgoCD, following spec can be used to create the tfy-apps project.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: tfy-apps
  namespace: argocd
spec:
  clusterResourceWhitelist:
  - group: '*'
    kind: '*'
  destinations:
  - namespace: '*'
    server: '*'
  sourceNamespaces:
  - '*'
  sourceRepos:
  - '*'
  1. Argo Rollout: Argo Rollouts is used to power the rollout process of deployments - enabling blue-green, canary and different rollout strategies.
  2. Argo Workflows : TrueFoundry uses Argo Workflows to run the jobs deployed on the platform.
  3. Istio : Istio is used as the Ingress controller and also to power functionalities of Oauth authentication for Notebooks and traffic shaping abilities. TrueFoundry doesn’t impose the sidecar inject by default - its only done if we try to do request count based autoscaling or are trying to mirror or intercept traffic.

The rest of the dependencies are more use-case based and optional depending on if you are using that feature.

  1. Keda for workload Autoscaling: TrueFoundry uses Keda to autoscale your workloads based on time, requests count, CPU or queue length.
  2. Prometheus for Metrics: Prometheus powers the metrics dashboard in Truefoundry. Prometheus also helps provide some of the metrics for autoscaling.
  3. Loki for Logs: The logs feature in TrueFoundry is powered via Loki. This is optional and you can choose to provide your own logging solution.
  4. GPU operator : This is TrueFoundry provided helm chart for brining up GPU nodes in different clouds. Its based on Nvidia’s GPU operator.
  5. Grafana : You can install the TrueFoundry grafana helm chart that comes with a lot of inbuilt dashboards for cluster monitoring.

AWS Specific Components:

  1. Metrics-Server: This is required on AWS EKS cluster for autoscaling.
  2. AWS Ebs CSI Driver : This is required for supporting EBS volumes on EKS cluster.
  3. AWS Efs CSI Driver : Required for supporting EFS volumes for EFS cluster.
  4. TFY Inferentia operator : This is required for supporting Inferentia machines on EKS.
  5. AWS Load Balancer Controller : This is required for supporting load balancer on EKS.

Custom Components:

  1. Cert-Manager : This is needed for provisioning certificates for exposing services. In AWS you can use the AWS Certificate Manager to provision the certificates. For more details on how to setup the certificates, please refer to the TrueFoundry documentation.