Introduction
TrueFoundry jobs enable you to run task-oriented workloads which are meant to run for a certain duration to complete a task, then terminate and release the resources.
Here are some scenarios where Jobs are particularly well-suited:
- Model Training: Train machine learning models on large datasets, where the resource gets freed up once the training is complete.
- Maintenance and Cleanup: Schedule routine maintenance tasks, such as data backups, model retraining, report generation etc.
- Batch Inference: Perform large-scale batch inference tasks, such as processing large volumes of data using trained models, leveraging Job’s ability to handle parallel workloads efficiently.
TrueFoundry makes it easy to configure various aspects of your job deployment.
Dockerize Code
Deploy from Github, local machine or a prebuilt image.
Customize Resources
Set CPU, GPU, memory resources and spot/on-demand instances.
Environment Variables And Secrets
Set environment variables and secrets for your job.
Schedule Job
Schedule Job to run at a specific time.
Parameterize Job
Parameterize your job to enable ease of changing argument values.
Retries and Timeout
Set retries and timeout for your job in case the job gets stuck or fails for some reason.
Concurrency
Set concurrency limit for your job to specify how many instances of a Job can run at once.
Access Cloud Services
Access S3 / GCS /Azure Container / other cloud managed services.
Mounting Volumes
Mount volumes to cache data
Deploy Programatically
Deploy using Python and CLI
Interact with your job
Interact with your job
Setup CI/CD
Setup with your favorite CI/CD tool.
View Metrics
View the most important metrics for your job run.
View Logs
View the logs on a job run or per pod.
Set Up Alerts
Setup alerts and nottifications for your job
Clone, Update, and Rollback
Clone, update version, rollback to previous version and promote to production