Truefoundry Docs

Volumes are used to provide persistent storage to containers so that read and write data to a central disk across multiple pods. They are especially useful in the context of machine learning when you need to store and access data, models, and other artifacts required for training, serving, and inference tasks. A few usecases where volumes turn out to be very useful:

Share training data: Its possible multiple datascientists are training on the same data or we are running multiple experiments in parallel on the same dataset. The naive way will be to duplicate the data for multiple data scientists - however this will end up costing us much more. A more efficient way here will be to store the training data in a volume and mount the volume to the notebooks of different datascientists.
Model Storage: If we are hosting models as realtime APIs, there will be multiple replicas of the api server to handle traffic. Here, every replica will need to download the model from the model registry (let’s say S3) to local disk. If every replica does this repeatedly, it will take more time for starting up and also incur more S3 access cost. By using volumes, you can store your trained models externally and mount them onto the inference server.
Checkpointing: During the training of machine learning models, it’s common to save checkpoints periodically to resume training in case of failure or to fine-tune models. Volumes can be used to store these checkpoint files, ensuring that training progress is not lost when a job restarts from failure. This also enables you to run training on spot instances, hence saving a lot of cost.

When to use Volume vs Blob storage like S3 / GCS / Azure Container?

You can store data in volumes or in blob storage like S3/GCS/Azure container. The key differences between the two are:

Feature	Volume	Blob Storage
Access in Code	Mount volume and use standard file read/write APIs	Use Client SDK like boto3, gcs, azure-storage-blob to read/write data
Performance	Faster read/write speeds	Slower read/write speeds
Durability	High	Extremely high ((11 9’s))
Storage Cost	Higher cost	Lower cost
Access Cost	Performance-based (throughput & IOPS)	Request-based (GET, PUT, LIST, etc.)
Access Constraints	Limited to same region and cluster	Global access

Choose volume like EFS if:

You need a shared file system accessible by multiple instances.
Your application requires file system semantics (e.g., file locking, renaming).
You need low-latency access to files, especially for frequently accessed data.

We should be careful about not writing to the volume at the same path from multiple pods since it can cause data corruption.

Choose S3 if:

You need to store large amounts of unstructured data.
You don’t need file system semantics.
You need to archive data for long-term storage.
Cost is a primary concern, especially for infrequently accessed data.

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

LLM Deployment

LLM Finetuning

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

LLM Tracing

Platform

Introduction to Volume

When to use Volume vs Blob storage like S3 / GCS / Azure Container?

Getting Started

Train and Deploy Models

Service Deployment

Job Deployment

LLM Deployment

LLM Finetuning

Workflow Deployment

Async Service Deployment

Volumes

ML Repository

LLM Tracing

Platform

​When to use Volume vs Blob storage like S3 / GCS / Azure Container?

When to use Volume vs Blob storage like S3 / GCS / Azure Container?