Skip to main content
Alerts are important to monitor the health of your applications and get notified if something goes wrong. Truefoundry makes it easy to setup alerts for all your applications including Service, Async Service, Job, Helm Deployment, Volume, Notebook and SSH server. You can set up notification channels in Email, Slack or Pagerduty to be notified of the alerts. Truefoundry uses Prometheus AlertManager to power the alerts.

Key Components of an Alert

An alert primarily comprises of two components:
  1. Alert Rule: This is a rule in terms of a PromQL query that is evaluated periodically to check if its true. If its true for a configured duration, the alert is triggered.
Truefoundry provides the PromQL expressions of the most commonly used alerts which should suffice for most usecases. So you don’t need to necessarily learn PromQL to setup alerts.
  1. Severity: This is the severity of the alert which is used to categorize if the alert needs immediate attention or not. It can be either warning or critical. This can be used in PagerDuty to route the alert to the proper channel.
  2. Notification Channel: These are the channels where the alerts will be sent once they are triggered.

Setting up alerts in Truefoundry Services

To setup alerts, you need to follow the steps below:

1. Setup Notification Channels

You need to create an integration with either Slack, Email or PagerDuty before setting up a notification channel. If this is not already done, please refer to Adding slack, email, and pagerduty integrations documentation.
Before setting up the alerts, you have to configure the notification channels and add one or more notification channels to send alerts to.
Configure Notification Channels
You can choose any among Email, SlackBot or Pagerduty as the notification channel to send alerts.
Email Channel
You can add multiple notification channels to send alerts to.

2. Create Alert

You can choose among the already available alerts or create your own custom alert. In most of the cases, the existing alerts should suffice for your usecase. Here are a few of the alerts already available:
This alert is triggered when the container CPU throttling rate exceeds 25% in the last 5 minutes, indicating CPU resource exhaustion. This means it might be affecting the latency or throughput of the service. The solution is usually to increase the CPU request and limit of the service.
This alert is triggered when a pod restarts more than 5 times within a 1 hour period, indicating a crash loop. You should check the logs and events to understand the root cause of the crash.
This alert is triggered when a pod is not in a running state for more than 15 minutes. This can happen if the pod is in pending, unknown, or failed status. This can happen if we cannot provision an instance or there is an application error. You can check the logs and events to understand the root cause of the pod not being healthy.
This alert is triggered when the requests are failing and we are getting 5XX responses. This usually happens because of a bug in the application code or some pods not running successfully.
This alert only takes into account the requests going through the load-balancer. This means that if you are calling the service directly using <your-service-name>.svc.cluster.local, it will not be taken into account. It should be called using the domain url which should provide in the Ports section after clicking on Expose button.
Create Alert Rule
If you need something apart from the above, you can create your own alert using the New Alert Rule form in the UI. To create and verify the PromQL query, we recommend using Grafana UI to test the query. The key fields to fill up in the form are:
  • Name: A descriptive name for your alert.
  • Description: (Optional) Briefly describe what this alert monitors.
  • Prometheus Expression: Enter the Prometheus query that defines the alert condition. For example:
    sum(rate(http_requests_total{status!="2xx"}[5m])) by (service) > 5
    
    This triggers if there are more than 5 non-2xx HTTP responses in 5 minutes for any service.
  • Trigger After (seconds): How long the condition must be true before triggering the alert.
  • Severity: Choose between Warning and Critical.
  • Notification Enabled: Enable or disable notifications for this rule.
Create Alert Rule Form

Applying AlertRules via YAML in GitOps

You can apply AlertRules via YAML in GitOps. You can copy the YAML from the Code icon in the Alerts page.

I