Alerts are important to monitor the health of your applications and get notified if something goes wrong. Truefoundry makes it easy to setup alerts for all your applications including Service, Async Service, Job, Helm Deployment, Volume, Notebook and SSH server. You can set up notification channels in Email, Slack or Pagerduty to be notified of the alerts.

Truefoundry uses Prometheus AlertManager to power the alerts.

Key Components of an Alert

An alert primarily comprises of two components:

  1. Alert Rule: This is a rule in terms of a PromQL query that is evaluated periodically to check if its true. If its true for a configured duration, the alert is triggered.

Truefoundry provides the PromQL expressions of the most commonly used alerts which should suffice for most usecases. So you don’t need to necessarily learn PromQL to setup alerts.

  1. Severity: This is the severity of the alert which is used to categorize if the alert needs immediate attention or not. It can be either warning or critical. This can be used in PagerDuty to route the alert to the proper channel.
  2. Notification Channel: This is the channel where the alerts will be sent once they are triggered.

Setting up alerts in Truefoundry Services

To setup alerts, you need to follow the steps below:

1. Setup Notification Channels

You need to create an integration with either Slack, Email or PagerDuty before setting up a notification channel. If this is not already done, please refer to Adding slack, email, and pagerduty integrations documentation.

Before setting up the alerts, you have to configure the notification channels and add one or more notification channels to send alerts to.

There are Three types of notification channels:

  • Email: Select the email provider integration fqn from the dropdown and enter the email addresses to send alerts to.
  • SlackBot: Select the SlackBot integration fqn from the dropdown and enter the Slack channel names to send alerts to.
  • PagerDuty: Select the PagerDuty integration fqn from the dropdown to send alerts to that PagerDuty service.

Click Add Targets to configure multiple notification channels.

Define Alert Rule

When you create a new application, you can see that there are already some alerts available for your application which are created by TrueFoundry, though these alerts are already there, you have to enable the notification for them to get alerts notifications.

Default Alerts

1. HighCPUThrottleRate

Description: This alert is triggered when the container CPU throttling rate exceeds 25% in the last 5 minutes, indicating CPU resource exhaustion.

2. PodIsCrashLooping

Description: This alert is triggered when a pod restarts more than 5 times within a 1 hour period, indicating a crash loop.

3. PodIsNotHealthy

Description: This alert is triggered when a pod is not in a running state for more than 15 minutes. This can happen if the pod is in pending, unknown, or failed status.

4. RequestSuccessRateIsLow

Description: This alert is triggered when the service failure rate (5XX responses) is greater than 5% in the last 5 minutes.

Creating your own custom alerts

You can create alerts for any application type (Service, Job, Helm, Volume, etc.) using the Create Alert Rule form in the UI.

Fill out the alert rule form:

  • Name: A descriptive name for your alert.

  • Description: (Optional) Briefly describe what this alert monitors.

  • Prometheus Expression: Enter the Prometheus query that defines the alert condition. For example:

    sum(rate(http_requests_total{status!="2xx"}[5m])) by (service) > 5
    

    This triggers if there are more than 5 non-2xx HTTP responses in 5 minutes for any service.

  • Trigger After (seconds): How long the condition must be true before triggering the alert.

  • Severity: Choose between Warning and Critical.

  • Notification Enabled: Enable or disable notifications for this rule.


How Alerts Work

  • When the Prometheus expression evaluates to true for the specified duration, an alert is triggered.
  • Notifications are sent to all configured targets (Email, SlackBot, etc.).
  • You can configure multiple alerts for the same application, each with different conditions.

Best Practices

  • Use descriptive names and descriptions for your alerts.
  • Test your Prometheus expressions in the Prometheus UI before using them in alerts.
  • Set appropriate severity levels to distinguish between warnings and critical issues.
  • Regularly review and update your alert rules as your application evolves.

Troubleshooting

  • Ensure your notification channels are correctly configured and active.
  • If you do not receive alerts, check the Prometheus expression and ensure it matches your application’s metrics.

With this setup, you can proactively monitor all your TrueFoundry applications and receive timely notifications to keep your systems healthy and reliable.