A Fun Monitoring Story

February 4, 2024

In my time at a previous job, I was approached by an application team whose Lambda alarm was constantly going off in our alerts webex channel.

LAMBDA-NAME-Duration-Exceeded
Alarm Description: Lambda Duration > 80
------------------------------------------
Threshold crossed on 2 datapoints: [600, 1400] were >= the threshold [24.0]

First, I checked the AWS cloudwatch alarm and, indeed, the lambda datapoints were well over the threshold. Then I went into the cloudwatch logs and the runtimes weren’t erroring or giving any warn messages.

I then went back to the error message above and dissected it:

Lambda Duration > 80: Ok, 80 what? seconds, milliseconds, percent?
The threshold [24.0]: 24 WHAT 😑?
2 datapoints [600, 1400] (...) threshold [24.0]: Why are the datapoints WAY higher than the threshold to begin with?

Having been thoroughly confused, I decided to analyze the module that generated this alarm. The module was set up so that the team can define everything for the lambda in one Terraform module block. One input for the lambda module was timeout and this parameter was measured in seconds. This timeout is what the duration alarm’s threshold was built on. The equation for the duration alarm threshold was the following:

lambda_duration_threshold = lambda_function_timeout * a_user_defined_percent

Which brought light to why we’re seeing the error message. In the above example, 80 meant “80%” and the threshold was just the lambda’s timeout (in this case 30 seconds) * 80% = 24 seconds.

But that didn’t explain how the lambda was running for datapoints way larger than the threshold. Besides, if the timeout is 30 seconds, how is the lambda somehow running for 600s or 1400s?

🙃 it’s because the threshold is in milliseconds! 🙃

So it turns out that our threshold (what we thought was 24 seconds) was actually 24 milliseconds. Off by a factor of 1000. In fact, other teams had already found this out and multiplied their user_defined_percentage by 1000. But for the rest of our organization, everyone seemed to be ignoring the alarms in their webex channels. Which was interesting because across multiple webex channels, we were receiving 100s of these alarms every day. Newcomers subscribed to these channels to get notified of bad things happening in nonprod and prod environments, but came to eventually silence the chats as they were so noisy.

Which brings me to my first lesson learned. Always include your units 😄.

The second lesson: Alerting is great…in moderation. When it’s too noisy or distracting, people become numb to it though. I’ve had times in my career where I didn’t know something was broken because I 1) hated getting notifications about things I didn’t care about (and therefore silenced it) and 2) figured someone else would fix it when the time came. However, when leadership asked me why I didn’t act on the causal attributes of an outage sooner, those exuses didn’t cut it. If you’re alerting on irrelevant metrics, expect ignorance. If you’re over-alerting on metrics, expect inaction.