Truefoundry Docs

Truefoundry control-plane is a helm chart that can be installed on any Kubernetes cluster. The control-plane has the following architecture:

Architecture

Control Plane Components

Component	Used For	Description
Dashboard	Essential	This is the UI component to view deployments, and other resources.
Backend Service	Essential	Truefoundry comprises of multiple backend services that handle various aspects like authorization, deployment control flow, CRUD APIs, interaction with database and external services, etc.
PostgreSQL	Essential	Database to store user information, deployment information, etc. This can be deployed on Kubernetes in dev-mode, however, we recommend using managed database like RDS in production.
Controller	Essential	The controller is responsible for handling all connections from the multiple tfy-agent components in different compute-plane clusters.
Queue	Essential	We use NATS as queueing and caching layer to be able to process requests and logging from the LLM Gateway.
Image Builder	Deployment Only	We build docker images on the Kubernetes cluster using buildkit for our deployments.
AI Gateway	AI Gateway Only	AI Gateway to unify the request and response format for all LLM providers.
Clickhouse	AI Gateway Only	Database to store request logs and metrics of AI Gateway.

External Cloud Components

Component	Description
Blob Storage	Control-plane needs access to one blob storage to store the code uploaded for building the docker image. This can be backed by AWS S3, Azure Blob Storage, GCS, etc.
Secret Store	Secret store to store the secrets for the deployment. This can be backed by AWS SSM, Azure Vault, GCS, etc.
Docker registry	Docker registry to store the docker images for the deployment. This can be backed by AWS ECR, Azure Container Registry, GCR, etc.

Compute Requirements

To install the control-plane, we need a Kubernetes cluster and a managed Postgres database. Truefoundry ships as a helm chart (https://github.com/truefoundry/infra-charts/tree/main/charts/truefoundry) that has configurable options to either deploy both Deployment and AI Gateway feature or just choose the one of them according to your needs. The compute requirements change based on the set of features and the scale of the number of users and requests. Here are a few scenarios that you can choose from based on your needs.

The small tier is recommended for development purposes. Here all the components are deployed on Kubernetes and in non HA mode (single replica). This is suitable if you are just testing out the different features of Truefoundry.

This setup brings up 1 replica of the services and is not highly-available. It can enable you to test the features but we do not recommend this for production mode.

Component	CPU	Memory	Storage	Min Nodes	Remarks
Helm-Chart (AI Deployment + AI Gateway)	2 vCPU	8GB	60GB _{Persistent Volumes (Block Storage) On Kubernetes}	2 _{Pods should be spread over min 2 nodes}	_{Cost: ~ $120 pm}
Helm-Chart (AI Deployment Only)	1 vCPU	4GB	50GB _{Persistent Volumes (Block Storage) On Kubernetes}	2 _{Pods should be spread over min 2 nodes}	_{Cost: ~ $60 pm}
Helm-Chart (AI Gateway Only)	2 vCPU	8GB	60GB _{Persistent Volumes (Block Storage) On Kubernetes}	2 _{Pods should be spread over min 2 nodes}	_{Cost: ~ $120 pm}
Postgres (Deployed on Kubernetes)	0.5 vCPU	0.5GB	5GB _{Persistent Volumes (Block Storage) On Kubernetes}		PostgreSQL version >= 13
Blob Storage (S3 Compatible)			20GB

Deploying Control-Plane in your own environment

Following scenarios are supported in the provided terraform code. You can find the requirements for each scenario in each cloud provider section:

New network + New cluster - This is the simplest setup. The TrueFoundry terraform code takes care of spinning up and setting up everything. Make sure your cloud account is ready with the requirements as per your cloud provider page
Existing network + New cluster - In this setup, you come with your own VPC and truefoundry terraform code takes care of creating the cluster in the same VPC. Do make sure to adhere to the existing VPC related requirements mentioned in your cloud provider page
Existing cluster - In this setup, the TrueFoundry terraform code reuses the cluster created by you to setup all the integrations needed for the platform to work. Do make sure to adhere to the existing VPC and existing cluster related requirements mentioned in your cloud provider page

Deploying in Production Mode

To deploy in production mode, we will first create the appropriate infrastructure components before moving on to actual implementation. The guides for individual cloud providers wrt infrastructure related requirements and steps to create them are available here: Provisioning Control Plane Infrastructure on AWS Provisioning Control Plane Infrastructure on GCP Provisioning Control Plane Infrastructure on Azure Once the infra components are setup, we can go ahead and install the control plane using the helm chart - Installing Control Plane using Helm Chart

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Platform

Deploying On Your Own Cloud

Deploy Control Plane Only

Architecture

Control Plane Components

External Cloud Components

Compute Requirements

Deploying Control-Plane in your own environment

Deploying in Production Mode

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

Platform

Deploying On Your Own Cloud

​Architecture

​Control Plane Components

​External Cloud Components

​Compute Requirements

​Deploying Control-Plane in your own environment

​Deploying in Production Mode

Architecture

Control Plane Components

External Cloud Components

Compute Requirements

Deploying Control-Plane in your own environment

Deploying in Production Mode