- Kubernetes Cluster - This is the primary compute on which the applications deployed by the control plane run. This can be EKS, GKE, AKS, Openshift, Oracle Kubernetes engine orany other standard Kubernetes cluster. The control plane should have read access to the cluster configuration. The following addons need to be present on the compute-plane cluster depending on your usecase. You can bring your own addons and have full flexibility on the configuration of the addons.
ArgoCD (Essential)
ArgoCD (Essential)
TrueFoundry relies on ArgoCD to deploy applications to the compute-plane cluster. The infra applications are deployed in the default project in argocd
while the user deployed applications are deployed in tfy-apps project.If you are using your own ArgoCD, please make sure of the following requirements:You can find the ArgoCD configuration file that Truefoundry installs by default here.
- Ensure argocd has access to create argo applications in all namespace. For this following things must be set
- Create a tfy-apps project with the following spec.
Istio (Essential)
Istio (Essential)
Istio is a really powerful service mesh and ingress controller. TrueFoundry uses Istio as the primary ingress controller in the compute-plane cluster. We don’t inject the sidecar by
default - its only injected in cases where needed for the following usecases:There are three istio components that TrueFoundry installs:
- Request Count Based autoscaling
- Oauth based authentication and authorization for Jupyter Notebooks.
- Intercepts feature to redirect / mirror traffic to other applications.
- Authentication for services deployed on the cluster.
Please ensure that if you have multiple Istio gateways, they do not have the same domains configured. If that is the case, then we will need to specify which gateway to use for the Truefoundry components as a variable in the tfy-agent helm chart.
- istio-base - These are the bunch of CRDs that are required for Istio to work. You can find the argocd configuration here.
- istio-discovery - This is pilot service that is responsible for the discovery of the services in the cluster. You can find the argocd configuration here.
- tfy-istio-ingress - This is the ingress gateway that is responsible for the ingress of the services to the cluster. You can find the argocd configuration here.
ArgoRollouts (Essential)
ArgoRollouts (Essential)
Argo Rollouts is used to power the canary and blue-green rollout strategies in TrueFoundry.If you are already using Argo Rollouts in your cluster, Truefoundry should be able to work with it without any additional configuration.You can find the argocd configuration here.
Prometheus (Essential)
Prometheus (Essential)
Prometheus is used to power the metrics feature on the platform. It also powers the autoscaling, autoshutdown and autopilot features of the platform. TrueFoundry uses the opensource kube-prometheus-stack for running prometheus in the cluster.If you are already using kube-prometheus-stack in your cluster, TrueFoundry should be able to work with it with the following configuration changes:andYou can find the argocd configuration here.
ArgoWorkflows (Optional)
ArgoWorkflows (Optional)
TrueFoundry uses Argo Workflows to power the Jobs feature on the platform.If you are already using Argo Workflows in your cluster, Truefoundry should be able to work with the following configuration:You can find the argocd configuration here.
Keda (Optional)
Keda (Optional)
Keda is used to power the autoscaling feature on the platform. TrueFoundry uses the opensource keda for event driver autoscaling in the cluster.If you are already using Keda in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.
Victoria logs (Optional)
Victoria logs (Optional)
Victoria logs and Vector are used to power the logs feature on the platform. This is optional and you can choose to provide your own logging solution.If you are already using Victoria logs in your cluster, Truefoundry should be able to work without any additional configuration. If you are already using vector to ingest logs, Truefoundry should be able to work with the following configuration:You can find the argocd configuration here.
GPU Operator (Optional)
GPU Operator (Optional)
GPU Operator is used to deploy workloads on the GPU nodes. It’s a TrueFoundry provided helm chart that’s based on Nvidia’s GPU operator.If you are already using nvidia’s GPU Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration for the following cloud providers
Grafana (Optional)
Grafana (Optional)
Grafana is a monitoring tool that can be installed to view the metrics, logs and create dashboards on the cluster. TrueFoundry doesn’t direcly use grafana to power the monitoring dashboard on the platform but it is available to view additional cluster level metrics as a separate addon.If you are using Grafana in your cluster, you can use it for monitoring the cluster. But if you want to use the TrueFoundry provided Grafana, you can install the TrueFoundry grafana helm chart that comes with a lot of inbuilt dashboards for cluster monitoring.You can find the argocd configuration here.
[AWS Only] Karpenter (Essential)
[AWS Only] Karpenter (Essential)
Karpenter is required for supporting dynamic node provisioning on AWS EKS.If you are already using Karpenter in your cluster, Truefoundry should be able to work with the following additional configuration:You can find the karpenter argocd configuration here.We also install tfy-karpenter-config which is another helm chart that installs the nodepools and nodeclasses. If you are already using Karpenter in your cluster, TrueFoundry requires following nodepool types to be present:
You can find the tfy-karpenter-config argocd configuration here.
- Install eks-node-monitoring-agent helm chart.
- Configure Karpenter to use the eks-node-monitoring-agent.
Nodepool Type | Configuration | Purpose |
---|---|---|
Critical | amd64 linux on-demand nodepool with taint class.truefoundry.com/component=critical:NoSchedule and label class.truefoundry.com/component=critical | For running TrueFoundry critical workloads like prometheus, victoria-logs and tfy-agent. |
GPU nodepool | amd64 linux on-demand/spot (both) with taint nvidia.com/gpu=true:NoSchedule and label nvidia.com/gpu.deploy.operands=true | For running user deployed GPU applications. |
Default nodepool | amd64 linux on-demand/spot (both) without any taints | For running user deployed CPU applications. |
[AWS Only] Metrics-Server (Essential)
[AWS Only] Metrics-Server (Essential)
Metrics-Server is required on AWS EKS cluster for autoscaling.If you are already using Metrics-Server in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.
[AWS Only] AWS EBS CSI Driver (Essential)
[AWS Only] AWS EBS CSI Driver (Essential)
AWS EBS CSI Driver is required for supporting EBS volumes on EKS cluster.If you are already using AWS EBS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a default storage class to be present in the cluster preferrably gp3 backed by encrypted volumes.You can find the argocd configuration here.
[AWS Only] AWS EFS CSI Driver (Optional)
[AWS Only] AWS EFS CSI Driver (Optional)
AWS EFS CSI Driver is required for supporting EFS volumes for EKS cluster.If you are already using AWS EFS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a storage class to be present in the cluster which can be used for mounting EFS volumes.You can find the argocd configuration here.
[AWS Only] AWS Load Balancer Controller (Essential)
[AWS Only] AWS Load Balancer Controller (Essential)
AWS Load Balancer Controller is required for supporting load balancer on EKS.If you are already using AWS Load Balancer Controller in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.
[AWS Only] TFY Inferentia Operator (Optional)
[AWS Only] TFY Inferentia Operator (Optional)
TFY Inferentia Operator is required for supporting Inferentia machines on EKS.If you are already using Inferentia Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.
Cert-Manager (Optional)
Cert-Manager (Optional)
Cert-Manager is required for provisioning certificates for exposing services. In AWS you can use the AWS Certificate Manager to provision the certificates. For more details on how to setup the certificates, please refer to the TrueFoundry documentation.
- Docker Registry - This is the registry where the docker images built by the control plane are pushed. This can be ECR, GCR, ACR, Quay, Dockerhub, JFrog, Harbour, or your own docker registry.
Quay or any other standard docker registry. The control plane should have access to push to this registry. - Blob Storage (Optional) - This will be used to store the ML models and artifacts. This is optional and only needed if you intend to use the model registry feature. The control plane should have access to read and write to this blob storage. Possible integrations are AWS S3, Azure Blob Storage, GCP Storage, Minio or any other S3 compatible storage.
- Secret Store (Optional) - This will be used to store the secrets for the applications. This is optional and only needed if you intend to use the secret store feature. The control plane should have access to read and write to this secret store. Possible integrations are AWS ParameterStore, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager.
- DNS and Certificates - We need to setup a domain name and point it to the load balancer in the compute-plane cluster. This will allow us to provide domain names to the workloads deployed on the cluster. The domain name can be a single domain or a wild card domain. For e.g. if you point a domain like
*.example.com
(wildcard domain) to the load balancer, the services will be exposed likeservice1.example.com
,service2.example.com
etc. However, if you point a domain liketfy.example.com
(non-wildcard domain) to the load balancer, the services will be exposed liketfy.example.com/service1
,tfy.example.com/service2
etc. Many frontend applications do not work with path based routing (the latter case) and hence we recommend using wildcard domains. We also need to provision certificates to terminate the TLS traffic on the load balancer. For this we can use the AWS Certificate Manager or cert-manager with GCP Cloud DNS or Azure DNS. You can also bring your pre-created certificates.