No company wants a customer to have a poor experience using their product, especially after they have invested, built, marketed and released them. For any product, its reliability is a core metric. We are a consumer-facing industry with millions of users. The reliability of our tech services is something that we monitor day in and out and for this Shaadi heavily relies on Datadog APM and monitoring. Similarly, Sangam too has metrics on DD.
The problem started when we, the new developers, were included in the alert emails that were sent by Datadog whenever a warning or alert is triggered. This led to a wave of alert emails flowing into my inbox.
Every morning, I could see a big list of Monitor alerts crowding my inbox. I found this annoying and wanted to resolve it. I observed that this new thing gave rise to 3 problems.
Constantly getting bombarded with notifications can become very annoying. And turning them off is not a solution. They were so frequent that they used to spoil our sleep at night. When I started getting these alerts, I jumped on to check what was happening. But I found out that – most of them were pointless.
2. Unnecessary alerts
I gathered the courage to go through the long list of alert mails and found out that most of them were unnecessary warnings. These warnings were triggered even if the graph slightly crossed the threshold and recovered. Also, the existing team was used to this and had a fair idea about which alerts were normal and didn’t need attention while to us this was new.
The major concern was that sometimes important mails got lost in the vast list of monitor alerts. There could be ones that need an immediate response. It won’t be long enough when I find myself in trouble by missing an important mail. And something needed to be done.
We discussed this and agreed that we should consider making our alerting smarter. For Sangam, we have 34 monitors. I went through each one of them and checked their frequency of notifications in the past 2 weeks. This helped me narrow down my area of investigation to just 7 monitors that caused most of the noise.
Now let’s dive into the implementation of infra alerts in Datadog that caused the issues.
We’ll be using the error monitor of hera, our core service which serves most of our back-end APIs, as an example as we go further.
For hera errors monitor, the threshold for the warning was 100 errors. As soon as the error count went above 100, an alert was triggered. So why couldn’t we simply raise the threshold value?
- Anomaly detection is an algorithmic feature that identifies when a metric is behaving differently than it has in the past, taking into account trends, seasonal day-of-week, and time-of-day patterns.
- It is well-suited for metrics with strong trends and recurring patterns that are hard to monitor with threshold-based alerting.
- Anomaly alerts calculate an expected range of values for a series based on the past using any of the three algorithms i.e. basic, agile and robust.
- Some of the anomaly algorithms use the time-of-day and day-of-week to determine the expected range, thus capturing abnormalities that could not be detected by a simple threshold alert.
- On each alert evaluation, Datadog calculates the percentage of the series that falls above, below, or outside (as configured) of the expected range.
- An alert is triggered when this percentage crosses the configured threshold.
- Basic: Used when metrics have no repeating seasonal pattern. It uses little data and adjusts quickly to changing conditions, but has no knowledge of seasonal behaviours or longer trends.
- Robust: Used when seasonal metrics are expected to be stable, and slow, level shifts are considered anomalies. It is stable and predictions remain constant even through long-lasting anomalies at the expense of taking longer to respond to intended level shifts.
- Agile: Used when metrics are seasonal and expected to shift. The algorithm quickly adjusts to metric level shifts. It incorporates the immediate past into its predictions, allowing quick updates for level shifts at the expense of being less robust to recent, long-lasting anomalies.
- Deviations: The width of the grey band.
- Seasonality: The seasonality (hourly, daily, or weekly) of the cycle for the agile or robust algorithm to analyse the metric.
- Thresholds: The percentage of points that need to be anomalous for alerting, warning, and recovery.
New monitors, based on anomaly detection using both Agile and Robust, were created and the alert notification counts were logged for a period of 5 days. The trigger window was set to 10 minutes. The thresholds for errors and latency monitors were set to 50% and 80% respectively. This meant that the alert will be triggered only when 50% of the error count or 80% of the latency count falls above the expected range (grey band).