Description
What did you do?
We have a cluster of Alertmanager instances running in HA mode. We've configured Prometheus to discover the Alertmanager instances using DNS SRV SD. In our deployment, the DNS records are served by consul. When an Alertmanager instance fails or is removed, the SRV record is quickly updated to reflect the new topology. We had a host with an Alertmanager instance fail and observed the Prometheus prometheus_notifications_queue_length
grow uncontrollably and we started getting Alert notification queue full, dropping alerts
messages in the log.
Reproduction steps:
- Deploy an Alertmanager cluster in HA mode
- Configure Prometheus to discover Alertmanager using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config with SRV records
- Kill one Alertmanager instance and remove that instance from the SRV record
- Wait until the Prometheus notification queue fills
What did you expect to see?
We expected Prometheus to stop sending alerts to the failed Alertmanager quickly after the DNS record changed. We expected Prometheus to continue sending alerts to the Alertmanagers that remained in the DNS record.
What did you see instead? Under which circumstances?
We experienced a failure on a host running one of the Alertmanager instances. The failed instance become unreachable and it was removed from the DNS record within a few seconds. Prometheus continued to attempt to post alerts to the failed Alertmanager until we restarted the process. Because every request hit the context timeout, the notification queue started to grow uncontrollably and we started to see notifications dropped. This is the same behavior described in #7676.
This was unexpected for our configuration because the DNS records changed to reflect the missing Alertmanger very quickly. Upon investigation, we found that the prometheus_sd_discovered_targets{name="notify"}
reflected the new topology almost immediately (which is exactly what we expected). However, prometheus_notifications_alertmanagers_discovered
never reflected the change (until we manually restarted prometheus). This was unexpected.
We did some investigation and believe we have found the root cause: https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L346 performs a non-blocking write to https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L90 which is an unbuffered channel. For Alertmanager notifications, this is read by https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L310 which is effectively a non-blocking read when the notifier is under heavy load (because n.more
is likely to be readable). This becomes a death-spiral because the long timeout gives more time for more alerts to enqueue which increases the chance that n.more
will be readable. The spiral only stops once the maximum notification queue depth is reached. In our case this reached a steady-state of n.more
always being readable and tsets
never getting read. This means that service discovery targets for Alertmanager don't actually propagate to the notifier unless the loops in the manager and the notifier happen to perfectly line up. In the case where an Alertmanager instance has failed, the expected state of the notifier is to be in sendAll
almost all the time because every attempt to post an alert will require waiting for the context to expire which is much longer than the time it takes to execute the select statement in Run
. In practice, there's no reason to expect that the notifier will ever get new service discovery information in this situation.
This is a serious problem because it means any Alertmanager instance becomes a signal point of failure for the entire Prometheus+Alertmanager deployment. We observed that the rate of notifications to the other Alertmanager instances steeply declined and notifications began getting dropped.
We think this would behave better if the syncCh
was buffered. I'll open a PR with that change shorty.
System information
Linux
Prometheus version
2.45.1
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
We saw both these messages repeatedly in the Prometheus logs:
caller=manager.go:246 level=debug component="discovery manager notify" msg="Discovery receiver's channel was full so will retry the next cycle"
caller=notifier.go:371 level=warn component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=3