Description
What did you do?
DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts
(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).
What did you expect to see?
Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.
Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.
What did you see instead? Under which circumstances?
prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.
Alertmanagers are configured via static_configs.
Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.
DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.
DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.
Environment
- System information:
Linux 3.10.0-1127.el7.x86_64 x86_64
- Prometheus version:
prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4
- Alertmanager version:
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4
- Prometheus configuration file:
Relevant bit:
alerting:
alert_relabel_configs:
- regex: 'replica'
action: labeldrop
alertmanagers:
- scheme: https
path_prefix: alertmanager/
basic_auth:
username: prometheus
password: "xxxxx"
static_configs:
- targets:
- 'prod-am01.dc01'
- 'prod-am02.dc02'
- 'prod-am03.dc03'
prod-am02.dc02
was down (along with the rest of DC02) - commenting that out of the config fixed the issue.
- Alertmanager configuration file:
Don't think it's relevant, but let me know.
- Logs:
Sample from DC01 prom instance:
Jul 24 12:15:26 prod-prom01 prometheus: level=warn ts=2020-07-24T12:15:26.539Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=29