TrueFoundry allows you to set up alerts for all types of applications—Services, Async Services, Jobs, Helm Deployments, Volumes, Notebooks/SSH using Prometheus expressions. Alerts can be configured to notify you via Email or SlackBot when certain conditions are met, helping you monitor and respond to issues proactively.

Prerequisites

Before setting up alerts, ensure you have configured your notification channels notification providers:

  1. Email: Set up SMTP credentials in the Integrations page. This can be found in Custom Provider Account. You can configure the SMTP Credentials of the mail server, from_email and use it to send notifications.
  2. SlackBot: For this, please create a bot token for your slack and then add it as an integration in Slack Provider Account (Slack Bot Integration). It requires chat:write and chat:write:public scope. You can add slack channels to send to respective slack channel.

For more information about how to configure notification channels, please refer to Adding Notification Integrations.


Creating Alerts

1. Configure Notification Channels

Before setting up the alerts, you have to configure the notification channels and add one or more notification channels to send alerts to.

There are two types of notification channels:

  • Email: Select the email provider integration fqn from the dropdown and enter the email addresses to send alerts to.
  • SlackBot: Select the SlackBot integration fqn from the dropdown and enter the Slack channel names to send alerts to.

Click Add Targets to configure multiple notification channels.

2. Define Alert Rule

You can create alerts for any application type (Service, Job, Helm, Volume, etc.) using the Create Alert Rule form in the UI.

Fill out the alert rule form:

  • Name: A descriptive name for your alert.

  • Description: (Optional) Briefly describe what this alert monitors.

  • Prometheus Expression: Enter the Prometheus query that defines the alert condition. For example:

    sum(rate(http_requests_total{status!="2xx"}[5m])) by (service) > 5

    This triggers if there are more than 5 non-2xx HTTP responses in 5 minutes for any service.

  • Trigger After (seconds): How long the condition must be true before triggering the alert.

  • Severity: Choose between Warning and Critical.

  • Notification Enabled: Enable or disable notifications for this rule.


Example: CPU Throttling Alert

Suppose you want to be notified if a container is CPU throttled for more than 25% of the time in the last 5 minutes:

sum(increase(container_cpu_cfs_throttled_periods_total{container="my-app"}[5m])) by (container, namespace) /
sum(increase(container_cpu_cfs_periods_total[5m])) by (container, namespace) > (25 / 100)

Set the Trigger After to 300 seconds (5 minutes), and select your notification targets.


How Alerts Work

  • When the Prometheus expression evaluates to true for the specified duration, an alert is triggered.
  • Notifications are sent to all configured targets (Email, SlackBot, etc.).
  • You can configure multiple alerts for the same application, each with different conditions.

Best Practices

  • Use descriptive names and descriptions for your alerts.
  • Test your Prometheus expressions in the Prometheus UI before using them in alerts.
  • Set appropriate severity levels to distinguish between warnings and critical issues.
  • Regularly review and update your alert rules as your application evolves.

Troubleshooting

  • Ensure your notification channels are correctly configured and active.
  • If you do not receive alerts, check the Prometheus expression and ensure it matches your application’s metrics.

With this setup, you can proactively monitor all your TrueFoundry applications and receive timely notifications to keep your systems healthy and reliable.