Adding retries and handling failures
TrueFoundry workflows provide robust mechanisms for handling task failures and retrying failed tasks. Here's how you can implement these features:
Task Retries
You can configure automatic retries for individual tasks using the retries
parameter in the @task
decorator:
@task(task_config=task_config, retries=3)
def my_task():
...
This configuration will attempt to execute the task up to 3 additional times if it fails.
Note: These retries are specifically user retries (if the code fails due to a code error).
If there are infrastructure issues, like spot-interruptions and errors like OOM killed, they are considered as infra failure and can be configured using a parameter called max-node-retries-system-failures
which is a cluster level setting. The default value of this field is 3
. Please contact your system admin to change this value.
Workflow Failure Handling
To handle failures at the workflow level, you can define a failure handler task and specify it using the on_failure
parameter in the @workflow
decorator:
from truefoundry.workflow import task, workflow
@task(task_config=task_config)
def handle_failure():
print("Handling Failure/Sending Notification")
...
@workflow(on_failure=handle_failure)
def data_pipeline():
...
If your workflow fails, this will run the "handle_failure" task towards the end. This can be used to clean up some resources or database entries or files and also send alert notification.
Updated about 1 month ago