Truefoundry Docs

The compute plane comprises of one or more Kubernetes clusters on which the applications deployed by the users run. This can be AWS EKS cluster, GKE cluster, AKS cluster, Openshift cluster, Oracle Kubernetes Engine cluster, or any other standard on-prem Kubernetes cluster.

The compute plane is always in the customer’s own cloud environment. Truefoundry doesn’t provide Kubernetes clusters as compute on its own. This ensures all data and compute stay within the customer’s own infrastructure.Truefoundry can help create a new compute plane cluster using Terraform (recommended) or also use an existing cluster. If using an existing cluster, please make sure you conform to the key requirements mentioned below.

The tfy-agent runs on the compute plane and is responsible for connecting to the control-plane. It connects to the control-plane via a secure WebSocket connection and then receives the instructions from the control-plane while also sending realtime updates about the Kubernetes resources to the control-plane.

The compute plane cluster hosts the infrastructure-related K8s applications like ArgoCD, GPU operator, etc and also the user-deployed applications. The key infrastructure addons on the Kubernetes cluster are as follows:

ArgoCD (Essential)

TrueFoundry relies on ArgoCD to deploy applications to the compute-plane cluster. The infra applications are deployed in the default project in argocd while the user deployed applications are deployed in tfy-apps project.If you are using your own ArgoCD, please make sure of the following requirements:

Ensure argocd has access to create argo applications in all namespace. For this following things must be set

server.extraArgs[0]="--insecure"
server.extraArgs[1]="--application-namespaces=*"
controller.extraArgs[0]="--application-namespaces=*"

Create a tfy-apps project with the following spec.

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: tfy-apps
  namespace: argocd
spec:
  clusterResourceWhitelist:
  - group: '*'
    kind: '*'
  destinations:
  - namespace: '*'
    server: '*'
  sourceNamespaces:
  - '*'
  sourceRepos:
  - '*'

You can find the ArgoCD configuration file that Truefoundry installs by default here.

Prometheus (Essential)

Prometheus is used to power the metrics feature on the platform. It also powers the autoscaling, autoshutdown and autopilot features of the platform. TrueFoundry uses the opensource kube-prometheus-stack for running prometheus in the cluster.If you are already using kube-prometheus-stack in your cluster, TrueFoundry should be able to work with it with the following configuration changes:

  kube-state-metrics:
    metricsLabelsAllowlist:
    - pods=[truefoundry.com/application,truefoundry.com/component-type,truefoundry.com/component,truefoundry.com/application-id]
    - nodes=[karpenter.sh/capacity-type,eks.amazonaws.com/capacityType,kubernetes.azure.com/scalesetpriority,kubernetes.azure.com/accelerator,cloud.google.com/gke-provisioning,node.kubernetes.io/instance-type]

and

  alertmanager:
    alertmanagerSpec:
      alertmanagerConfigMatcherStrategy:
        type: None

You can find the argocd configuration here.

TFY Agent (Essential)

TFY Agent is the agent that runs on the compute plane cluster and is responsible for connecting the cluster to the control-plane. It connects to the control plane via a secure WebSocket connection and then receives the instructions from the control plane while also sending realtime updates about the Kubernetes resources to the control plane.You can find the argocd configuration here.

Istio (Optional)

Istio is a really powerful service mesh and ingress controller. TrueFoundry uses Istio as the primary ingress controller in the compute-plane cluster. If you are using any other Ingress controller, most of the features in the platform will still work except the ones listed below that specifically rely on Istio envoy proxy or envoy filters.

We don’t inject the sidecar by default - its only injected in cases where needed for usecases mentioned below

The key features that rely on Istio and will not work otherwise are:

Request Count Based autoscaling
Oauth based authentication and authorization for Jupyter Notebooks. Without Istio, there will be no authentication and authorization for the notebooks.
Intercepts feature to redirect / mirror traffic to other applications.
Authentication for services deployed on the cluster.

If you are already using Istio in your cluster, Truefoundry should be able to work with it without any additional configuration. Truefoundry agent automatically discovers the istio gateways and exposed the domains to the control plane.

Please ensure that if you have multiple Istio gateways, they do not have the same domains configured. If that is the case, then we will need to specify which gateway to use for the Truefoundry components as a variable in the tfy-agent helm chart.

There are three istio components that TrueFoundry installs:

istio-base - These are the bunch of CRDs that are required for Istio to work. You can find the argocd configuration here.
istio-discovery - This is pilot service that is responsible for the discovery of the services in the cluster. You can find the argocd configuration here.
tfy-istio-ingress - This is the ingress gateway that is responsible for the ingress of the services to the cluster. You can find the argocd configuration here.

ArgoRollouts (Optional)

Argo Rollouts is used to power the canary and blue-green rollout strategies in TrueFoundry.If you are already using Argo Rollouts in your cluster, Truefoundry should be able to work with it without any additional configuration.You can find the argocd configuration here.

ArgoWorkflows (Optional)

TrueFoundry uses Argo Workflows to power the Jobs feature on the platform.If you are already using Argo Workflows in your cluster, Truefoundry should be able to work with the following configuration:

  controller:
    workflowDefaults:
      spec:
        activeDeadlineSeconds: 432000
        ttlStrategy:
          secondsAfterCompletion: 3600
    namespaceParallelism: 1000
    parallelism: 1000

You can find the argocd configuration here.

Keda (Optional)

Keda is used to power the autoscaling feature on the platform. TrueFoundry uses the opensource keda for event driver autoscaling in the cluster.If you are already using Keda in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.

TFY Logs (Optional)

Victoria logs and Vector are used to power the logs feature on the platform. This is optional and you can choose to provide your own logging solution.

Without tfy-logs, we will not be able to show the aggregated logs on the platform for the services.

If you are already using Victoria logs in your cluster, Truefoundry should be able to work without any additional configuration. If you are already using vector to ingest logs, Truefoundry should be able to work with the following configuration:

  customConfig:
    sinks:
      vlogs:
        type: elasticsearch
        query:
          _time_field: timestamp
          _stream_fields: namespace,pod,container,stream,truefoundry_com_application,truefoundry_com_deployment_version,truefoundry_com_component_type,truefoundry_com_retry_number,job_name,sparkoperator_k8s_io_app_name,truefoundry_com_buildName
        inputs:
          - parser
    transforms:
      parser:
        type: remap
        inputs:
          - k8s
        source: >
          if .message == "" {
            .message = " "
          }

          # Extract basic pod information

          .service = .kubernetes.container_name

          .container = .kubernetes.container_name

          .app = .kubernetes.container_name

          .pod = .kubernetes.pod_name

          .node = .kubernetes.pod_node_name

          .namespace = .kubernetes.pod_namespace

          .job_name = .kubernetes.job_name

          # Extract ALL pod labels dynamically using for_each

          pod_labels = object(.kubernetes.pod_labels) ?? {}

          # Iterate through all pod labels and add them with a prefix

          for_each(pod_labels) -> |key, value| {
            label_key = replace(replace(replace(key, ".", "_"), "/", "_"), "-", "_")
            . = set!(., [label_key], string(value) ?? "")
          }

          # Clean up kubernetes metadata

          del(.kubernetes)

          del(.file)
    sources:
      k8s:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 1000

You can find the argocd configuration here.

GPU Operator (Optional)

GPU Operator is used to deploy workloads on the GPU nodes. It’s a TrueFoundry provided helm chart that’s based on Nvidia’s GPU operator.If you are already using nvidia’s GPU Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration for the following cloud providers

Grafana (Optional)

Grafana is a monitoring tool that can be installed to view the metrics, logs and create dashboards on the cluster. TrueFoundry doesn’t direcly use grafana to power the monitoring dashboard on the platform but it is available to view additional cluster level metrics as a separate addon.If you are using Grafana in your cluster, you can use it for monitoring the cluster. But if you want to use the TrueFoundry provided Grafana, you can install the TrueFoundry grafana helm chart that comes with a lot of inbuilt dashboards for cluster monitoring.You can find the argocd configuration here.

[AWS Only] Karpenter (Essential)

Karpenter is required for supporting dynamic node provisioning on AWS EKS.If you are already using Karpenter in your cluster, Truefoundry should be able to work with the following additional configuration:

Install eks-node-monitoring-agent helm chart.
Configure Karpenter to use the eks-node-monitoring-agent.

settings:
  featureGates:
    nodeRepair: true

You can find the karpenter argocd configuration here.We also install tfy-karpenter-config which is another helm chart that installs the nodepools and nodeclasses. If you are already using Karpenter in your cluster, TrueFoundry requires following nodepool types to be present:

Nodepool Type	Configuration	Purpose
Critical	amd64 linux on-demand nodepool with taint `class.truefoundry.com/component=critical:NoSchedule` and label `class.truefoundry.com/component=critical`	For running TrueFoundry critical workloads like prometheus, victoria-logs and tfy-agent.
GPU nodepool	amd64 linux on-demand/spot (both) with taint `nvidia.com/gpu=true:NoSchedule` and label `nvidia.com/gpu.deploy.operands=true`	For running user deployed GPU applications.
Default nodepool	amd64 linux on-demand/spot (both) without any taints	For running user deployed CPU applications.

You can find the tfy-karpenter-config argocd configuration here.

[AWS Only] Metrics-Server (Essential)

Metrics-Server is required on AWS EKS cluster for autoscaling.If you are already using Metrics-Server in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.

[AWS Only] AWS EBS CSI Driver (Essential)

AWS EBS CSI Driver is required for supporting EBS volumes on EKS cluster.If you are already using AWS EBS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a default storage class to be present in the cluster preferrably gp3 backed by encrypted volumes.You can find the argocd configuration here.

[AWS Only] AWS EFS CSI Driver (Optional)

AWS EFS CSI Driver is required for supporting EFS volumes for EKS cluster.If you are already using AWS EFS CSI Driver in your cluster, Truefoundry should be able to work without any additional configuration. We do expect a storage class to be present in the cluster which can be used for mounting EFS volumes.You can find the argocd configuration here.

[AWS Only] AWS Load Balancer Controller (Essential)

AWS Load Balancer Controller is required for supporting load balancer on EKS.If you are already using AWS Load Balancer Controller in your cluster, Truefoundry should be able to work without any additional configuration.You can find the argocd configuration here.

[AWS Only] TFY Inferentia Operator (Optional)

TFY Inferentia Operator is required for supporting Inferentia machines on EKS.If you are already using Inferentia Operator in your cluster, TrueFoundry should be able to work without any additional configuration.You can find the argocd configuration here.

Cert-Manager (Optional)

Cert-Manager is required for provisioning certificates for exposing services. In AWS you can use the AWS Certificate Manager to provision the certificates. For more details on how to setup the certificates, please refer to the TrueFoundry documentation.

Platform

Truefoundry Compute Plane Architecture