Truefoundry Docs

TrueFoundry jobs enable you to run task-oriented workloads which are meant to run for a certain duration to complete a task, then terminate and release the resources. Here are some scenarios where Jobs are particularly well-suited:

Model Training: Train machine learning models on large datasets, where the resource gets freed up once the training is complete.
Maintenance and Cleanup: Schedule routine maintenance tasks, such as data backups, model retraining, report generation etc.
Batch Inference: Perform large-scale batch inference tasks, such as processing large volumes of data using trained models, leveraging Job’s ability to handle parallel workloads efficiently.

TrueFoundry makes it easy to configure various aspects of your job deployment.

Dockerize Code

Deploy from Github, local machine or a prebuilt image.

Customize Resources

Set CPU, GPU, memory resources and spot/on-demand instances.

Environment Variables And Secrets

Set environment variables and secrets for your job.

Schedule Job

Schedule Job to run at a specific time.

Trigger your job

Trigger your job manually

Parameterize Job

Parameterize your job to enable ease of changing argument values.

Retries and Timeout

Set retries and timeout for your job in case the job gets stuck or fails for some reason.

Concurrency

Set concurrency limit for your job to specify how many instances of a Job can run at once.

Access Cloud Services

Access S3 / GCS /Azure Container / other cloud managed services.

Mounting Volumes

Mount volumes to cache data

Deploy Programatically

Deploy using Python and CLI

Setup CI/CD

Setup with your favorite CI/CD tool.

View Metrics

View the most important metrics for your job run.

View Logs

View the logs on a job run or per pod.

Set Up Alerts

Setup alerts and nottifications for your job

Clone, Update, and Rollback

Clone, update version, rollback to previous version and promote to production

Setup Alerts Getting Started