Monitor Duty - Part 1 - My Eyes Glazed Over

My team has too many email alerts. I imagine this happens to a lot of teams. As we build systems we add new alerts so that we can monitor for issues. With the passage of time these alerts become either unnecessary, or in some cases weren’t well thought out in the first place.

I’ve started spending some time to look at our alerts and determine how to improve our monitoring capability. Ideally, I’d like to find unnecessary alerts for removal, figure out how to make our existing alerts better, repair items that are currently getting overlooked due to the noise to signal ratio, and put in some run books/checklists for how to handle these alerts.

My first significant challenge was to get a baseline look at the alerts that are coming into our inboxes. I started out trying to just come through the emails that I’ve filtered into my automation inbox, but it didn’t take long for my eyes to glaze over. Seeing multiple low value alerts in a row and attempting to deduplicate them by manually keeping a spreadsheet is a recipe for boredom and disaster. So I did what I often do, I wrote a PowerShell script:

$outlookProc = (Get-Process | ? { $_.Name -eq "OUTLOOK" })
if ($outlookProc -eq $null) {
    Start-Process outlook.exe;
}

$uniqueMessages = [System.Collections.Generic.Dictionary[string,psobject]]::new()

$outlook = New-Object -ComObject Outlook.Application
$namespace = $Outlook.GetNameSpace("MAPI");
$inboxFolders = $NameSpace.Folders | ? # code to filter through my specific inbox layout 

$inboxes = $inboxFolders  | % { $_.Items | select-Object -first 1000 }

$inboxes | % {
    $messageKey = "$($_.SenderEmailAddress), $($_.To), $($_.Subject)"
    if(-not $uniqueMessages.keys.contains($messageKey)){
        $uniqueMessages.Add($messageKey, $_);
    }
}

$uniqueMessages.values | % { @{Subject = $_.Subject; Sender = $_.SenderEmailAddress; To = $_.To}} | ConvertTo-Csv

The above code listing shows the script that I wrote to scan my inbox. The result was a CSV file with a unique subject, the email address of the sender, and recipient email. The subject allowed me to see broad strokes of what we were working with and provided search terms so I could filter for just those emails to examine. The sender email address provided clues on sources of the alerts. Was this from DataDog, SQL Server, a cloud alert, or some other monitoring service? And finally the to email address would usually tell me what distribution list and therefore which team members received this alert.

After compiling this information, I not only had my baseline understanding of the alerts that we’re getting flooded with, but I was able to pick out certain patterns. With these patterns in hand I was able to put together some tasks for my fellow team mates to get some easy wins on reducing the noise from our alerts. I also started putting together a plan for me to spend some sprints doing monitor duty on our systems so I can develop the aforementioned run books/ checklists by tackling problems that arise. At the time of writing this, I’ve already made some gains by strategically muting alerts and identifying some potential improvements in our monitoring. I’ll hopefully get a chance to follow up this post with some lessons learned as I finish some sprints doing monitor duty.