Alerts are important to monitor the health of your applications and get notified if something goes wrong. Truefoundry makes it easy to setup alerts for all your applications including Service, Async Service, Job, Helm Deployment, Volume, Notebook and SSH server. You can set up notification channels in Email, Slack or Pagerduty to be notified of the alerts.
Truefoundry uses Prometheus AlertManager to power the alerts.
An alert primarily comprises of two components:
configured duration
, the alert is triggered.Truefoundry provides the PromQL expressions of the most commonly used alerts which should suffice for most usecases. So you don’t need to necessarily learn PromQL to setup alerts.
warning
or critical
. This can be used in PagerDuty to route the alert to the proper channel.To setup alerts, you need to follow the steps below:
Before setting up the alerts, you have to configure the notification channels and add one or more notification channels to send alerts to.
You can choose any among Email, SlackBot or Pagerduty as the notification channel to send alerts.
You can choose among the already available alerts or create your own custom alert. In most of the cases, the existing alerts should suffice for your usecase. Here are a few of the alerts already available:
HighCPUThrottleRate
This alert is triggered when the container CPU throttling rate exceeds 25% in the last 5 minutes, indicating CPU resource exhaustion. This means it might be affecting the latency or throughput of the service. The solution is usually to increase the CPU request and limit of the service.
PodIsCrashLooping
This alert is triggered when a pod restarts more than 5 times within a 1 hour period, indicating a crash loop. You should check the logs and events to understand the root cause of the crash.
PodIsNotHealthy
This alert is triggered when a pod is not in a running state for more than 15 minutes. This can happen if the pod is in pending, unknown, or failed status. This can happen if we cannot provision an instance or there is an application error. You can check the logs and events to understand the root cause of the pod not being healthy.
RequestSuccessRateIsLow
This alert is triggered when the requests are failing and we are getting 5XX responses. This usually happens because of a bug in the application code or some pods not running successfully.
This alert only takes into account the requests going through the load-balancer. This means that if you are calling the service directly using <your-service-name>.svc.cluster.local
, it will not be taken into account. It should be called using the domain url which should provide in the Ports section after clicking on Expose
button.
If you need something apart from the above, you can create your own alert using the New Alert Rule form in the UI. To create and verify the PromQL query, we recommend using Grafana UI to test the query. The key fields to fill up in the form are:
Name
: A descriptive name for your alert.
Description
: (Optional) Briefly describe what this alert monitors.
Prometheus Expression
: Enter the Prometheus query that defines the alert condition. For example:
This triggers if there are more than 5 non-2xx HTTP responses in 5 minutes for any service.
Trigger After (seconds)
: How long the condition must be true before triggering the alert.
Severity
: Choose between Warning
and Critical
.
Notification Enabled
: Enable or disable notifications for this rule.
You can apply AlertRules via YAML in GitOps. You can copy the YAML from the Code icon in the Alerts page.