Adding Retries And Handling Failures
TrueFoundry workflows provide robust mechanisms for handling task failures and retrying failed tasks. Here’s how you can implement these features:
Task Retries
You can configure automatic retries for individual tasks using the retries
parameter in the @task
decorator:
This configuration will attempt to execute the task up to 3 additional times if it fails.
Note: These retries are specifically user retries (if the code fails due to a code error).
If there are infrastructure issues, like spot-interruptions and errors like OOM killed, they are considered as infra failure and can be configured using a parameter called max-node-retries-system-failures
which is a cluster level setting. The default value of this field is 3
. Please contact your system admin to change this value.
Workflow Failure Handling
To handle failures at the workflow level, you can define a failure handler task and specify it using the on_failure
parameter in the @workflow
decorator:
If your workflow fails, this will run the “handle_failure” task towards the end. This can be used to clean up some resources or database entries or files and also send alert notification.