Truefoundry Docs

TrueFoundry provides an easy way to attach a new compute plane to the control plane. It provides the Terraform code using which you can bring up all the necessary components of the compute plane in your cloud account. The key components of the compute plane are:

Kubernetes Cluster - This is the primary compute on which the applications deployed by the control plane run. This can be EKS, GKE, AKS, Openshift, Oracle Kubernetes engine orany other standard Kubernetes cluster. The control plane should have read access to the cluster configuration. The following addons need to be present on the compute-plane cluster depending on your usecase. You can bring your own addons and have full flexibility on the configuration of the addons.

ArgoCD (Essential)

TrueFoundry relies on ArgoCD to deploy applications to the compute-plane cluster. The infra applications are deployed in the default project in argocd while the user deployed applications are deployed in tfy-apps project.If you are using your own ArgoCD, please make sure of the following requirements:

Ensure argocd has access to create argo applications in all namespace. For this following things must be set

server.extraArgs[0]="--insecure"
server.extraArgs[1]="--application-namespaces=*"
controller.extraArgs[0]="--application-namespaces=*"

Create a tfy-apps project with the following spec.

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: tfy-apps
  namespace: argocd
spec:
  clusterResourceWhitelist:
  - group: '*'
    kind: '*'
  destinations:
  - namespace: '*'
    server: '*'
  sourceNamespaces:
  - '*'
  sourceRepos:
  - '*'

You can find the ArgoCD configuration file that Truefoundry installs by default here.

Istio (Essential)

Istio is a really powerful service mesh and ingress controller. TrueFoundry uses Istio as the primary ingress controller in the compute-plane cluster. We don’t inject the sidecar by default - its only injected in cases where needed for the following usecases:

Request Count Based autoscaling
Oauth based authentication and authorization for Jupyter Notebooks.
Intercepts feature to redirect / mirror traffic to other applications.
Authentication for services deployed on the cluster.

If you are already using Istio in your cluster, Truefoundry should be able to work with it without any additional configuration. Truefoundry agent automatically discovers the istio gateways and exposed the domains to the control plane.

Please ensure that if you have multiple Istio gateways, they do not have the same domains configured. If that is the case, then we will need to specify which gateway to use for the Truefoundry components as a variable in the tfy-agent helm chart.

There are three istio components that TrueFoundry installs:

istio-base - These are the bunch of CRDs that are required for Istio to work. You can find the argocd configuration here.
istio-discovery - This is pilot service that is responsible for the discovery of the services in the cluster. You can find the argocd configuration here.
tfy-istio-ingress - This is the ingress gateway that is responsible for the ingress of the services to the cluster. You can find the argocd configuration here.

ArgoRollouts (Essential)

Argo Rollouts is used to power the canary and blue-green rollout strategies in TrueFoundry.If you are already using Argo Rollouts in your cluster, Truefoundry should be able to work with it without any additional configuration.You can find the argocd configuration here.

Prometheus (Essential)

Prometheus is used to power the metrics feature on the platform. It also powers the autoscaling, autoshutdown and autopilot features of the platform. TrueFoundry uses the opensource kube-prometheus-stack for running prometheus in the cluster.If you are already using kube-prometheus-stack in your cluster, TrueFoundry should be able to work with it with the following configuration changes:

  kube-state-metrics:
    metricsLabelsAllowlist:
    - pods=[truefoundry.com/application,truefoundry.com/component-type,truefoundry.com/component,truefoundry.com/application-id]
    - nodes=[karpenter.sh/capacity-type,eks.amazonaws.com/capacityType,kubernetes.azure.com/scalesetpriority,kubernetes.azure.com/accelerator,cloud.google.com/gke-provisioning,node.kubernetes.io/instance-type]

and

  alertmanager:
    alertmanagerSpec:
      alertmanagerConfigMatcherStrategy:
        type: None

You can find the argocd configuration here.

ArgoWorkflows (Optional)

TrueFoundry uses Argo Workflows to power the Jobs feature on the platform.If you are already using Argo Workflows in your cluster, Truefoundry should be able to work with the following configuration:

  controller:
    workflowDefaults:
      spec:
        activeDeadlineSeconds: 432000
        ttlStrategy:
          secondsAfterCompletion: 3600
    namespaceParallelism: 1000
    parallelism: 1000

You can find the argocd configuration here.

Keda (Optional)

Keda is used to power the autoscaling feature on the platform. TrueFoundry uses the opensource keda for event driver autoscaling in the cluster.If you are already using Keda in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.

Victoria logs (Optional)

Victoria logs and Vector are used to power the logs feature on the platform. This is optional and you can choose to provide your own logging solution.If you are already using Victoria logs in your cluster, Truefoundry should be able to work without any additional configuration. If you are already using vector to ingest logs, Truefoundry should be able to work with the following configuration:

  customConfig:
    sinks:
      vlogs:
        type: elasticsearch
        query:
          _time_field: timestamp
          _stream_fields: namespace,pod,container,stream,truefoundry_com_application,truefoundry_com_deployment_version,truefoundry_com_component_type,truefoundry_com_retry_number,job_name,sparkoperator_k8s_io_app_name,truefoundry_com_buildName
        inputs:
          - parser
    transforms:
      parser:
        type: remap
        inputs:
          - k8s
        source: >
          if .message == "" {
            .message = " "
          }

          # Extract basic pod information

          .service = .kubernetes.container_name

          .container = .kubernetes.container_name

          .app = .kubernetes.container_name

          .pod = .kubernetes.pod_name

          .node = .kubernetes.pod_node_name

          .namespace = .kubernetes.pod_namespace

          .job_name = .kubernetes.job_name

          # Extract ALL pod labels dynamically using for_each

          pod_labels = object(.kubernetes.pod_labels) ?? {}

          # Iterate through all pod labels and add them with a prefix

          for_each(pod_labels) -> |key, value| {
            label_key = replace(replace(replace(key, ".", "_"), "/", "_"), "-", "_")
            . = set!(., [label_key], string(value) ?? "")
          }

          # Clean up kubernetes metadata

          del(.kubernetes)

          del(.file)
    sources:
      k8s:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 1000

You can find the argocd configuration here.

GPU Operator (Optional)

GPU Operator is used to deploy workloads on the GPU nodes. It’s a TrueFoundry provided helm chart that’s based on Nvidia’s GPU operator.If you are already using nvidia’s GPU Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration for the following cloud providers

Grafana (Optional)

Grafana is a monitoring tool that can be installed to view the metrics, logs and create dashboards on the cluster. TrueFoundry doesn’t direcly use grafana to power the monitoring dashboard on the platform but it is available to view additional cluster level metrics as a separate addon.If you are using Grafana in your cluster, you can use it for monitoring the cluster. But if you want to use the TrueFoundry provided Grafana, you can install the TrueFoundry grafana helm chart that comes with a lot of inbuilt dashboards for cluster monitoring.You can find the argocd configuration here.

[AWS Only] Karpenter (Essential)

Karpenter is required for supporting dynamic node provisioning on AWS EKS.If you are already using Karpenter in your cluster, Truefoundry should be able to work with the following additional configuration:

Install eks-node-monitoring-agent helm chart.
Configure Karpenter to use the eks-node-monitoring-agent.

settings:
  featureGates:
    nodeRepair: true

You can find the karpenter argocd configuration here.We also install tfy-karpenter-config which is another helm chart that installs the nodepools and nodeclasses. If you are already using Karpenter in your cluster, TrueFoundry requires following nodepool types to be present:

Nodepool Type	Configuration	Purpose
Critical	amd64 linux on-demand nodepool with taint `class.truefoundry.com/component=critical:NoSchedule` and label `class.truefoundry.com/component=critical`	For running TrueFoundry critical workloads like prometheus, victoria-logs and tfy-agent.
GPU nodepool	amd64 linux on-demand/spot (both) with taint `nvidia.com/gpu=true:NoSchedule` and label `nvidia.com/gpu.deploy.operands=true`	For running user deployed GPU applications.
Default nodepool	amd64 linux on-demand/spot (both) without any taints	For running user deployed CPU applications.

You can find the tfy-karpenter-config argocd configuration here.

[AWS Only] Metrics-Server (Essential)

Metrics-Server is required on AWS EKS cluster for autoscaling.If you are already using Metrics-Server in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.

[AWS Only] AWS EBS CSI Driver (Essential)

AWS EBS CSI Driver is required for supporting EBS volumes on EKS cluster.If you are already using AWS EBS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a default storage class to be present in the cluster preferrably gp3 backed by encrypted volumes.You can find the argocd configuration here.

[AWS Only] AWS EFS CSI Driver (Optional)

AWS EFS CSI Driver is required for supporting EFS volumes for EKS cluster.If you are already using AWS EFS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a storage class to be present in the cluster which can be used for mounting EFS volumes.You can find the argocd configuration here.

[AWS Only] AWS Load Balancer Controller (Essential)

AWS Load Balancer Controller is required for supporting load balancer on EKS.If you are already using AWS Load Balancer Controller in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.

[AWS Only] TFY Inferentia Operator (Optional)

TFY Inferentia Operator is required for supporting Inferentia machines on EKS.If you are already using Inferentia Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.

Cert-Manager (Optional)

Cert-Manager is required for provisioning certificates for exposing services. In AWS you can use the AWS Certificate Manager to provision the certificates. For more details on how to setup the certificates, please refer to the TrueFoundry documentation.

Docker Registry - This is the registry where the docker images built by the control plane are pushed. This can be ECR, GCR, ACR, Quay, Dockerhub, JFrog, Harbour, or your own docker registry.
Quay or any other standard docker registry. The control plane should have access to push to this registry.
Blob Storage (Optional) - This will be used to store the ML models and artifacts. This is optional and only needed if you intend to use the model registry feature. The control plane should have access to read and write to this blob storage. Possible integrations are AWS S3, Azure Blob Storage, GCP Storage, Minio or any other S3 compatible storage.
Secret Store (Optional) - This will be used to store the secrets for the applications. This is optional and only needed if you intend to use the secret store feature. The control plane should have access to read and write to this secret store. Possible integrations are AWS ParameterStore, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager.
DNS and Certificates - We need to setup a domain name and point it to the load balancer in the compute-plane cluster. This will allow us to provide domain names to the workloads deployed on the cluster. The domain name can be a single domain or a wild card domain. For e.g. if you point a domain like *.example.com (wildcard domain) to the load balancer, the services will be exposed like service1.example.com, service2.example.com etc. However, if you point a domain like tfy.example.com (non-wildcard domain) to the load balancer, the services will be exposed like tfy.example.com/service1, tfy.example.com/service2 etc. Many frontend applications do not work with path based routing (the latter case) and hence we recommend using wildcard domains. We also need to provision certificates to terminate the TLS traffic on the load balancer. For this we can use the AWS Certificate Manager or cert-manager with GCP Cloud DNS or Azure DNS. You can also bring your pre-created certificates.

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

LLM Deployment

LLM Finetuning

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

LLM Tracing

Platform

Deploying On Your Own Cloud

Deploy Compute Plane