Notification queue fills with single down AM instance

What did you do?

DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts

(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).

What did you expect to see?

Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.

Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.

What did you see instead? Under which circumstances?

prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.

Alertmanagers are configured via static_configs.

Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.

DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.

DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.

Environment

System information:

Linux 3.10.0-1127.el7.x86_64 x86_64

Prometheus version:

prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4

Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4

Prometheus configuration file:
Relevant bit:

alerting:
  alert_relabel_configs:
    - regex: 'replica'
      action: labeldrop
  alertmanagers:
    - scheme: https
      path_prefix: alertmanager/
      basic_auth:
        username: prometheus
        password: "xxxxx"
      static_configs:
        - targets:
          - 'prod-am01.dc01'
          - 'prod-am02.dc02'
          - 'prod-am03.dc03'

prod-am02.dc02 was down (along with the rest of DC02) - commenting that out of the config fixed the issue.

Alertmanager configuration file:

Don't think it's relevant, but let me know.

Logs:
Sample from DC01 prom instance:

Jul 24 12:15:26 prod-prom01 prometheus: level=warn ts=2020-07-24T12:15:26.539Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions