Deploy Control Plane Only
Truefoundry control-plane is a helm chart that can be installed on any Kubernetes cluster. The control-plane has the following architecture:
Architecture
Control Plane Components
Component | Used For | Description |
---|---|---|
Dashboard | Essential | This is the UI component to view deployments, and other resources. |
Backend Service | Essential | Truefoundry comprises of multiple backend services that handle various aspects like authorization, deployment control flow, CRUD APIs, interaction with database and external services, etc. |
PostgreSQL | Essential | Database to store user information, deployment information, etc. This can be deployed on Kubernetes in dev-mode, however, we recommend using managed database like RDS in production. |
Controller | Essential | The controller is responsible for handling all connections from the multiple tfy-agent components in different compute-plane clusters. |
Queue | Essential | We use NATS as queueing and caching layer to be able to process requests and logging from the LLM Gateway. |
Image Builder | Deployment Only | We build docker images on the Kubernetes cluster using buildkit for our deployments. |
AI Gateway | AI Gateway Only | AI Gateway to unify the request and response format for all LLM providers. |
Clickhouse | AI Gateway Only | Database to store request logs and metrics of AI Gateway. |
External Cloud Components
Component | Description |
---|---|
Blob Storage | Control-plane needs access to one blob storage to store the code uploaded for building the docker image. This can be backed by AWS S3, Azure Blob Storage, GCS, etc. |
Secret Store | Secret store to store the secrets for the deployment. This can be backed by AWS SSM, Azure Vault, GCS, etc. |
Docker registry | Docker registry to store the docker images for the deployment. This can be backed by AWS ECR, Azure Container Registry, GCR, etc. |
Compute Requirements
To install the control-plane, we need a Kubernetes cluster and a managed Postgres database. Truefoundry ships as a helm chart (https://github.com/truefoundry/infra-charts/tree/main/charts/truefoundry) that has configurable options to either deploy both Deployment and AI Gateway feature or just choose the one of them according to your needs. The compute requirements change based on the set of features and the scale of the number of users and requests.
Here are a few scenarios that you can choose from based on your needs.
The small tier is recommended for development purposes. Here all the components are deployed on Kubernetes and in non HA mode (single replica). This is suitable if you are just testing out the different features of Truefoundry.
This setup brings up 1 replica of the services and is not highly-available. It can enable you to test the features but we do not recommend this for production mode.
Component | CPU | Memory | Storage | Min Nodes | Remarks |
---|---|---|---|---|---|
Helm-Chart (AI Deployment + AI Gateway) | 2 vCPU | 8GB | 60GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $120 pm |
Helm-Chart (AI Deployment Only) | 1 vCPU | 4GB | 50GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $60 pm |
Helm-Chart (AI Gateway Only) | 2 vCPU | 8GB | 60GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $120 pm |
Postgres (Deployed on Kubernetes) | 0.5 vCPU | 0.5GB | 5GB Persistent Volumes (Block Storage) On Kubernetes | PostgreSQL version >= 13 | |
Blob Storage (S3 Compatible) | 20GB |
The small tier is recommended for development purposes. Here all the components are deployed on Kubernetes and in non HA mode (single replica). This is suitable if you are just testing out the different features of Truefoundry.
This setup brings up 1 replica of the services and is not highly-available. It can enable you to test the features but we do not recommend this for production mode.
Component | CPU | Memory | Storage | Min Nodes | Remarks |
---|---|---|---|---|---|
Helm-Chart (AI Deployment + AI Gateway) | 2 vCPU | 8GB | 60GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $120 pm |
Helm-Chart (AI Deployment Only) | 1 vCPU | 4GB | 50GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $60 pm |
Helm-Chart (AI Gateway Only) | 2 vCPU | 8GB | 60GB Persistent Volumes (Block Storage) On Kubernetes | 2 Pods should be spread over min 2 nodes | Cost: ~ $120 pm |
Postgres (Deployed on Kubernetes) | 0.5 vCPU | 0.5GB | 5GB Persistent Volumes (Block Storage) On Kubernetes | PostgreSQL version >= 13 | |
Blob Storage (S3 Compatible) | 20GB |
The medium tier is configured for production and will suffice teams of 10-500 members.
This configuration can handle upto 10,000 deployments daily and around 50 Kubernetes compute-plane clusters. The AI Gateway is configured for a minimum 3 replicas which can handle around 500 requests/second to LLMs.
It’s configured to be horizontally scalable and autoscale when the load increases. The Block Storage and S3 are used to store LLM request logs. The size is dependent on the size and number of requests and should be set as per the expected usage.
Component | CPU | Memory | Storage | Min Nodes | Remarks |
---|---|---|---|---|---|
Helm-Chart (AI Deployment + AI Gateway) | 16 vCPU | 56GB | 450GB Persistent Volumes (Block Storage) On Kubernetes | 3 Pods should be spread over min 3 nodes | Cost: ~ $1000 pm |
Helm-Chart (AI Deployment Only) | 8 vCPU | 24GB | 250GB Persistent Volumes (Block Storage) On Kubernetes | 3 Pods should be spread over min 3 nodes | Cost: ~ $400 pm |
Helm-Chart (AI Gateway Only) | 16 vCPU | 48GB | 250GB Persistent Volumes (Block Storage) On Kubernetes | 3 Pods should be spread over min 3 nodes | Cost: ~ $800 pm |
Postgres (Managed Database) | 2 vCPU | 4GB | 30GB Persistent Volumes (Block Storage) On Kubernetes | PostgreSQL version >= 13 | |
Blob Storage (S3 Compatible) | 500GB |
The large tier is configured for production and will suffice organizations of 500-50000 members.
This configuration can handle upto 1 million deployments daily and over 500 Kubernetes compute-plane clusters. The AI Gateway is configured for a minimum 10 replicas which can handle around 2000 requests/second to LLMs.
It’s configured to be horizontally scalable and autoscale when the load increases. The Block Storage and S3 are used to store LLM request logs. The size is dependent on the size and number of requests and should be set as per the expected usage.
Component | CPU | Memory | Storage | Min Nodes | Remarks |
---|---|---|---|---|---|
Helm-Chart (AI Deployment + AI Gateway) | 32 vCPU | 72GB | 750GB Persistent Volumes (Block Storage) On Kubernetes | 10 Pods should be spread over min 5 nodes | Cost: ~ $1600 pm |
Helm-Chart (AI Deployment Only) | 16 vCPU | 32GB | 300GB Persistent Volumes (Block Storage) On Kubernetes | 5 Pods should be spread over min 5 nodes | Cost: ~ $1000 pm |
Helm-Chart (AI Gateway Only) | 32 vCPU | 64GB | 400GB Persistent Volumes (Block Storage) On Kubernetes | 10 Pods should be spread over min 10 nodes | Cost: ~ $1400 pm |
Postgres (Managed Database) | 8 vCPU | 16GB | 100GB Persistent Volumes (Block Storage) On Kubernetes | PostgreSQL version >= 13 | |
Blob Storage (S3 Compatible) | 1000GB |
Deploying Control-Plane in your own environment
Following scenarios are supported in the provided terraform code. You can find the requirements for each scenario in each cloud provider section:
- New network + New cluster - This is the simplest setup. The TrueFoundry terraform code takes care of spinning up and setting up everything. Make sure your cloud account is ready with the requirements as per your cloud provider page
- Existing network + New cluster - In this setup, you come with your own VPC and truefoundry terraform code takes care of creating the cluster in the same VPC. Do make sure to adhere to the existing VPC related requirements mentioned in your cloud provider page
- Existing cluster - In this setup, the TrueFoundry terraform code reuses the cluster created by you to setup all the integrations needed for the platform to work. Do make sure to adhere to the existing VPC and existing cluster related requirements mentioned in your cloud provider page
Deploying in Production Mode
To deploy in production mode, we will first create the appropriate infrastructure components before moving on to actual implementation. The guides for individual cloud providers wrt infrastructure related requirements and steps to create them are available here:
Provisioning Control Plane Infrastructure on AWS
Provisioning Control Plane Infrastructure on GCP
Provisioning Control Plane Infrastructure on Azure
Once the infra components are setup, we can go ahead and install the control plane using the helm chart - Installing Control Plane using Helm Chart