Power of Datadog’s Anomaly Monitor

By Farhan Shaikh

In Engineering

8 Mar 2022

7 min read

No company wants a customer to have a poor experience using their product, especially after they have invested, built, marketed and released them. For any product, its reliability is a core metric. We are a consumer-facing industry with millions of users. The reliability of our tech services is something that we monitor day in and out and for this Shaadi heavily relies on Datadog APM and monitoring. Similarly, Sangam too has metrics on DD.
The problem started when we, the new developers, were included in the alert emails that were sent by Datadog whenever a warning or alert is triggered. This led to a wave of alert emails flowing into my inbox.

Every morning, I could see a big list of Monitor alerts crowding my inbox. I found this annoying and wanted to resolve it. I observed that this new thing gave rise to 3 problems.

1. Noise

Constantly getting bombarded with notifications can become very annoying. And turning them off is not a solution. They were so frequent that they used to spoil our sleep at night. When I started getting these alerts, I jumped on to check what was happening. But I found out that – most of them were pointless.

2. Unnecessary alerts

I gathered the courage to go through the long list of alert mails and found out that most of them were unnecessary warnings. These warnings were triggered even if the graph slightly crossed the threshold and recovered. Also, the existing team was used to this and had a fair idea about which alerts were normal and didn’t need attention while to us this was new.

3. Crowding

The major concern was that sometimes important mails got lost in the vast list of monitor alerts. There could be ones that need an immediate response. It won’t be long enough when I find myself in trouble by missing an important mail. And something needed to be done.

We discussed this and agreed that we should consider making our alerting smarter. For Sangam, we have 34 monitors. I went through each one of them and checked their frequency of notifications in the past 2 weeks. This helped me narrow down my area of investigation to just 7 monitors that caused most of the noise.

Now let’s dive into the implementation of infra alerts in Datadog that caused the issues.

All of our monitors were of a metric type and were configured based on some threshold values. On each alert evaluation, Datadog calculates the average/minimum/maximum/sum (as configured) over the selected period and checks if it is above or below the threshold. If the condition is satisfied, an alert is triggered. In simple words, a threshold alert compares the metric values to a static threshold. The keyword here is static.

We’ll be using the error monitor of hera, our core service which serves most of our back-end APIs, as an example as we go further.

For hera errors monitor, the threshold for the warning was 100 errors. As soon as the error count went above 100, an alert was triggered. So why couldn’t we simply raise the threshold value?

When we observe the graph, we can see that it is not static throughout, instead, it is periodic and the values vary throughout the day. The activity rises as soon as the day starts and goes down at night. Therefore, comparing these values to a static threshold doesn’t make sense. The threshold needs to be dynamically determined. It should be higher during the daytime and low at night time.

So what next . . .

This made us think that we should have an alerting that is triggered based on a trend that is set for a day. And this is when Anomaly detection by Datadog came into the picture.

Anomaly detection is an algorithmic feature that identifies when a metric is behaving differently than it has in the past, taking into account trends, seasonal day-of-week, and time-of-day patterns.
It is well-suited for metrics with strong trends and recurring patterns that are hard to monitor with threshold-based alerting.

For example, anomaly detection can help you discover when your web traffic is unusually low on a weekday afternoon—even though that same level of traffic is normal later in the evening. Or consider a metric measuring the number of logins to your steadily-growing site. Because the number increases daily, any threshold would be quickly outdated, whereas anomaly detection can alert you if there is an unexpected drop—potentially indicating an issue with the login system.

How does it work?

Anomaly alerts calculate an expected range of values for a series based on the past using any of the three algorithms i.e. basic, agile and robust.
Some of the anomaly algorithms use the time-of-day and day-of-week to determine the expected range, thus capturing abnormalities that could not be detected by a simple threshold alert.
On each alert evaluation, Datadog calculates the percentage of the series that falls above, below, or outside (as configured) of the expected range.
An alert is triggered when this percentage crosses the configured threshold.

Basic vs Agile vs Robust (Anomaly detection algorithms)

Basic: Used when metrics have no repeating seasonal pattern. It uses little data and adjusts quickly to changing conditions, but has no knowledge of seasonal behaviours or longer trends.
Robust: Used when seasonal metrics are expected to be stable, and slow, level shifts are considered anomalies. It is stable and predictions remain constant even through long-lasting anomalies at the expense of taking longer to respond to intended level shifts.
Agile: Used when metrics are seasonal and expected to shift. The algorithm quickly adjusts to metric level shifts. It incorporates the immediate past into its predictions, allowing quick updates for level shifts at the expense of being less robust to recent, long-lasting anomalies.

The grey band shows the prediction made by the algorithm. The alert is triggered when a set percentage of values falls above/below the grey band in a set time frame.

Other parameters

Deviations: The width of the grey band.
Seasonality: The seasonality (hourly, daily, or weekly) of the cycle for the agile or robust algorithm to analyse the metric.
Thresholds: The percentage of points that need to be anomalous for alerting, warning, and recovery.

The Action

New monitors, based on anomaly detection using both Agile and Robust, were created and the alert notification counts were logged for a period of 5 days. The trigger window was set to 10 minutes. The thresholds for errors and latency monitors were set to 50% and 80% respectively. This meant that the alert will be triggered only when 50% of the error count or 80% of the latency count falls above the expected range (grey band).

The Comparison

By observing the graphs below, we can see that the daily trend of high errors at 10:23 hours is accurately predicted by anomaly detection algorithms.

Similarly, here too the rise in latency is accurately predicted by agile algorithm whereas robust will take time to adjust to it. Although the latency went high, the robust configuration didn’t trigger an alert notification as the percentage of count above the grey band was less than 80%(set threshold value).

The Results

For the 6 out of 7 monitors that were taken into consideration, anomaly detection recorded an average of98% less notification countas compared to the regular threshold detection.

With this implementation, it was possible to increase the existing threshold values in older monitors without worrying about missing important alerts at night time. More importantly, we are able to focus on the alerts which really matter. But this is not the end here, we will continue monitoring this closely and keep fine-tuning the parameters for an enhanced experience and better reliability.