Deploy Control Plane Only
Architecture
Control Plane Components
Component | Description |
---|---|
Dashboard | Dashboard to manage users, deployments, and other resources. |
Backend Service | Backend service to handle authentication, authorization, manifest validation, manifest processing and other business logic. |
PostgreSQL | Database to store user information, deployment information, etc. |
Controller | Controller to manage the lifecycle of the deployment and is responsible for performing CRUD operations on the compute plane clusters. |
Queue | Queue to store intermediate data for deployment processing and AI Gateway requests logging. |
Image Builder | Image builder to build the docker images for the deployment. |
AI Gateway | AI Gateway to unify the request and response format for all LLM providers. |
Clickhouse | Database to store request logs and metrics of AI Gateway. |
Compute Plane Components
Component | Description |
---|---|
tfy-agent | Service to manage the lifecycle of the deployment and is responsible for performing requested operations on the compute plane clusters. |
Other infra components | Other infra components to enable different functionalities like GPU management, monitoring, logging, etc. |
External Cloud Components
Component | Description |
---|---|
Blob Storage | Storage used for storing source code, models and ml artifacts. This can be backed by AWS S3, Azure Blob Storage, GCS, etc. |
Secret Store | Secret store to store the secrets for the deployment. This can be backed by AWS SSM, Azure Vault, GCS, etc. |
Docker registry | Docker registry to store the docker images for the deployment. This can be backed by AWS ECR, Azure Container Registry, GCR, etc. |
Compute Requirements
With AI Gateway
Resource Tier | CPU | Memory | Min. Nodes | Disk | Blob Storage | Cost |
---|---|---|---|---|---|---|
small | 2 vCPU | 8GB | 2 | 60GB* | 50GB* | ~ $120 pm |
medium | 16 vCPU | 56GB | 3 | 450GB* | 200GB* | ~ $800 pm |
large | 28 vCPU | 72GB | 3 | 750GB* | 500GB* | ~ $1500 pm |
Without AI Gateway
Resource Tier | CPU | Memory | Min. Nodes | Disk | Blob Storage | Cost |
---|---|---|---|---|---|---|
small | 1 vCPU | 4GB | 2 | 50GB* | 10GB* | ~ $60 pm |
medium | 8 vCPU | 24GB | 3 | 250GB* | 20GB* | ~ $300 pm |
large | 16 vCPU | 32GB | 3 | 300GB* | 50GB* | ~ $800 pm |
(*) The disk and blob storage are used to store the request logs and metrics. The size of the disk and blob storage are dependent on the size and number of requests and should be set as per the expected usage.
Scenarios
Following scenarios are supported in the provided terraform code. You can find the requirements for each scenario in each cloud provider section:
- New network + New cluster - This is the simplest setup. The TrueFoundry terraform code takes care of spinning up and setting up everything. Make sure your cloud account is ready with the requirements as per your cloud provider page
- Existing network + New cluster - In this setup, you come with your own VPC and truefoundry terraform code takes care of creating the cluster in the same VPC. Do make sure to adhere to the existing VPC related requirements mentioned in your cloud provider page
- Existing cluster - In this setup, the TrueFoundry terraform code reuses the cluster created by you to setup all the integrations needed for the platform to work. Do make sure to adhere to the existing VPC and existing cluster related requirements mentioned in your cloud provider page
Deploying in Production Mode
To deploy in production mode, we will first create the appropriate infrastructure components before moving on to actual implementation. The guides for individual cloud providers wrt infrastructure related requirements and steps to create them are available here:
Provisioning Control Plane Infrastructure on AWS
Provisioning Control Plane Infrastructure on GCP
Provisioning Control Plane Infrastructure on Azure
Once the infra components are setup, we can go ahead and install the control plane using the helm chart - Installing Control Plane using Helm Chart