Monitoring and Alerting on Azure Service Bus

Ensuring our real-time payment systems have a market-leading level of reliability is one of our key mission statements in ClearBank Technology. A big part of how we achieve this is through a robust monitoring and alerting strategy, which allows us to react quickly to issues in our system before they become issues for our customers. 

In this post we’ll talk about how to setup monitoring and alerting on Azure Service Bus (ASB), including how to use infrastructure as code (specifically Terraform) to deploy alerts in a safe and repeatable way. 

At ClearBank we make heavy use of ASB. Every day we rely on it to send millions of messages across our platform, fulfilling a range of use cases including: 

  • Implementing the publisher/subscriber pattern our event driven architecture relies on. 
  • Distributing batch processing jobs across many parallel workers. 
  • Handling backpressure caused by spikes in traffic to our APIs. 

Since we use ASB in so many critical parts of our system it’s important for us to have the right level of monitoring and alerting to detect when things go wrong. 

What could go wrong? 

Lots of things! But to keep things simple we’ll just talk about two categories of problem: 

  • Dead-letter messages: when a service can’t process a message after a certain number of attempts the message will go onto a special queue known as the dead-letter queue, staying there until someone investigates. 
  • Too many active messages: when the number of active messages on a queue gets too high it’s usually an indication something has gone wrong; either a service is struggling to keep up with the load we’re putting through it or the service has gone down completely. 

Azure Monitor 

Fortunately, Azure Monitor gives us (almost) everything we need to setup alerts for ASB. To start off with we’ll use the Azure portal to setup an alert for our ASB namespace.  

There are two parts that make up an alert for ASB; an action group defines who should be alerted and how and an alert rule defines what condition should trigger the alert. We’ll create the action group first, to do this navigate to your ASB namespace in the Azure portal, select “Alerts” from the side menu, “Manage Actions” then “Add action group”. 

Once you’ve named the action group, select the “Actions” tab to setup what should happen when this action group is triggered.

Triggering a PagerDuty Alert

There are many ways to notify someone that something has gone wrong, but in this example we’ll use PagerDuty since it’s a commonly used tool and it’s what we use at ClearBank.

If you don’t already have one, create an Azure integration for the PagerDuty service you’re going to alert (services > integrations > add new integration). Once the integration is there, select it to get a copy of the integration URL. Now back in the Azure portal set the action type to “Webhook” and paste the PagerDuty integration URL into the URI field for the webhook. The action group is now configured to send a webhook to PagerDuty every time the alert is triggered.

Alert Rules

The last step is to setup the rules that cause an alert to be triggered. Back at the main Alerts section for your ASB namespace, select “New alert rule”. Select “Add condition” and then “Count of dead-lettered messages in a Queue/Topic”. Under “Split by dimensions” you can select the specific queue or topic you want to alert on (note you can’t select an individual subscription to a topic – more on this later). Under “Alert logic” you can setup the rules for when to trigger the alert. We’re going to have the alert trigger whenever there are more than zero dead-letters in our selected queue or topic, since even a single dead-letter could mean a payment hasn’t been processed and we need to investigate.

We’ll also repeat the above steps to add a rule to alert when the active message count is greater than 1000, since this can indicate a problem with our service’s ability to keep up with the load we’re putting through it.

The last step is to set the action group for the alert to the action group you created in the previous step, so that PagerDuty is notified when the alert rule is triggered.

Automating it with Terraform

Using the Azure portal is a helpful way to visualise the different components that make up an alert but having to execute those steps manually for each alert we want to create is time consuming and error prone. Instead, we treat our alerts the same as any other bit of infrastructure and create them using Terraform.  All the benefits we get from Terraform and how it fits into our release pipeline are details for another blog post, for now let’s just look at how the two main alert resources (action groups and alert rules) can be created with Terraform:

Action Group

resource "azurerm_monitor_action_group" "pager_duty_action_group" {
    name                = "pager-duty-action-group"
    resource_group_name = azurerm_resource_group.example.name
    short_name          = "pdact"

    webhook_receiver {
        name        = "Pager Duty"
        service_uri = "https://events.pagerduty.com/integration/example/enqueue"
    }
}

As you can see the action group resource is pretty simple, you just give it a name and the URL of your PagerDuty integration that will be called when any rules linked to the action group are triggered. The above code assumes you already created a resource group in Terraform named “example” that the action group will be added to.

Alert Rule

resource "azurerm_monitor_metric_alert" "example_service_bus_alert" {
    name                = "example-asb-alert"
    resource_group_name = azurerm_resource_group.example.name
    scopes              = [azurerm_servicebus_namespace.example.id]
    frequency           = "PT1M"
    description         = "Action will be triggered when dead-letter count is greater than 0, dead-letter count evaluated every 1 minute."
  
    criteria {
        metric_namespace = " Microsoft.ServiceBus/namespaces"
        metric_name      = "DeadLetteredMessages"
        aggregation      = "Average"
        operator         = "GreaterThan"
        threshold        = 0
        dimension {
            name     = "EntityName"
            operator = "Include"
            values   = ["example-asb-queue"]
        }
    }

    action {
        action_group_id = azurerm_monitor_action_group.pager_duty_action_group.id
    }
}

There’s more going on here but most of the properties are self-explanatory and can be mapped back to the example we did in the UI. The “scopes” property assumes you already created an ASB namespace named “example” using Terraform, but if your ASB namespace already exists outside of Terraform you can reference it by passing a string in the format: “/subscriptions/{your_azure_subscription_id}/resourceGroups/{asb_resource_group_name}/providers/Microsoft.ServiceBus/namespaces/{asb_namespace_name}”. The action property links the alert rule to the action group we created above.

Topics vs Subscriptions

As mentioned above Azure Monitor does not allow you to set up alert rules on an individual subscription to a topic. This causes a problem for us since one topic can have several subscriptions, all owned by different teams, making it unclear who to alert. Initially we approached this problem by alerting the team who publishes messages to the topic and having them track down the subscriber that caused the alerts. This quickly proved to be too inefficient, with awkward hand offs between teams stopping us from responding to problems as quickly as we’d like.

Auto-forwarding to Queues

The first decision we made to get around this issue was to use the auto-forwarding feature of ASB. For all new subscriptions we introduce we now auto-forward their messages onto a separate queue, which allows us to use the technique described above to just alert on those individual queues.

This still left us with the problem of all our existing subscriptions we wanted to add alerts to. We felt it would be too disruptive and risky to attempt to add auto-forwarding queues to all our existing subscriptions en masse, so we wanted to come up with a different solution.

Alerting on Subscriptions

We ended up solving the problem by creating an Azure Function that runs on a schedule to pull metrics from ASB about all our subscriptions and push them to Azure App Insights as custom metrics. We can then setup a very similar alert to the ones above based on those custom metrics. You can find the code for the Function App on our GitHub here. This diagram shows the components of the solution:

Setting up the alert on the App Insights metrics is very similar to setting up the alert directly on the ASB namespace, you just change the scopes to point at the App Insights instance you are pushing the metrics to and update the metric namespace to “Azure.ApplicationInsights”, like this:

resource "azurerm_monitor_metric_alert" "example_subscription_alert" {
    name                = "example-subscription-alert"
    resource_group_name = azurerm_resource_group.example.name
    scopes              = [azurerm_application_insights.example.id]
    frequency           = "PT1M"
    description         = "Action will be triggered when dead-letter count is greater than 0 on the subscription, dead-letter count evaluated every 1 minute."

    criteria {
        metric_namespace = "Azure.ApplicationInsights"
        metric_name      = " DeadLetteredMessages"
        aggregation      = "Average"
        operator         = "GreaterThan"
        threshold        = 0
  
        dimension {
            name     = "EntityName"
            operator = "Include"
            values   = ["example-asb-subscription"]
        }
    }

    action {
        action_group_id = azurerm_monitor_action_group.pager_duty_action_group.id
    }
}

And that’s it, we now get alerted any time there is a dead-letter or too many active messages on either an ASB queue or an individual subscription. Hopefully this is helpful to other tech teams who have chosen to make ASB a core part of their platform.

Andrew Gibson

Senior Technology Manager, ClearBank