Architecture

Control Plane Components

ComponentDescription
DashboardDashboard to manage users, deployments, and other resources.
Backend ServiceBackend service to handle authentication, authorization, manifest validation, manifest processing and other business logic.
PostgreSQLDatabase to store user information, deployment information, etc.
ControllerController to manage the lifecycle of the deployment and is responsible for performing CRUD operations on the compute plane clusters.
QueueQueue to store intermediate data for deployment processing and AI Gateway requests logging.
Image BuilderImage builder to build the docker images for the deployment.
AI GatewayAI Gateway to unify the request and response format for all LLM providers.
ClickhouseDatabase to store request logs and metrics of AI Gateway.

Compute Plane Components

ComponentDescription
tfy-agentService to manage the lifecycle of the deployment and is responsible for performing requested operations on the compute plane clusters.
Other infra componentsOther infra components to enable different functionalities like GPU management, monitoring, logging, etc.

External Cloud Components

ComponentDescription
Blob StorageStorage used for storing source code, models and ml artifacts. This can be backed by AWS S3, Azure Blob Storage, GCS, etc.
Secret StoreSecret store to store the secrets for the deployment. This can be backed by AWS SSM, Azure Vault, GCS, etc.
Docker registryDocker registry to store the docker images for the deployment. This can be backed by AWS ECR, Azure Container Registry, GCR, etc.

Compute Requirements

With AI Gateway

Resource TierCPUMemoryMin. NodesDiskBlob StorageCost
small2 vCPU8GB260GB*50GB*~ $120 pm
medium16 vCPU56GB3450GB*200GB*~ $800 pm
large28 vCPU72GB3750GB*500GB*~ $1500 pm

Without AI Gateway

Resource TierCPUMemoryMin. NodesDiskBlob StorageCost
small1 vCPU4GB250GB*10GB*~ $60 pm
medium8 vCPU24GB3250GB*20GB*~ $300 pm
large16 vCPU32GB3300GB*50GB*~ $800 pm

(*) The disk and blob storage are used to store the request logs and metrics. The size of the disk and blob storage are dependent on the size and number of requests and should be set as per the expected usage.

Scenarios

Following scenarios are supported in the provided terraform code. You can find the requirements for each scenario in each cloud provider section:

  • New network + New cluster - This is the simplest setup. The TrueFoundry terraform code takes care of spinning up and setting up everything. Make sure your cloud account is ready with the requirements as per your cloud provider page
  • Existing network + New cluster - In this setup, you come with your own VPC and truefoundry terraform code takes care of creating the cluster in the same VPC. Do make sure to adhere to the existing VPC related requirements mentioned in your cloud provider page
  • Existing cluster - In this setup, the TrueFoundry terraform code reuses the cluster created by you to setup all the integrations needed for the platform to work. Do make sure to adhere to the existing VPC and existing cluster related requirements mentioned in your cloud provider page

Deploying in Production Mode

To deploy in production mode, we will first create the appropriate infrastructure components before moving on to actual implementation. The guides for individual cloud providers wrt infrastructure related requirements and steps to create them are available here:

Provisioning Control Plane Infrastructure on AWS

Provisioning Control Plane Infrastructure on GCP

Provisioning Control Plane Infrastructure on Azure

Once the infra components are setup, we can go ahead and install the control plane using the helm chart - Installing Control Plane using Helm Chart