Alertmanager service discovery does not update under heavy load

What did you do?

We have a cluster of Alertmanager instances running in HA mode. We've configured Prometheus to discover the Alertmanager instances using DNS SRV SD. In our deployment, the DNS records are served by consul. When an Alertmanager instance fails or is removed, the SRV record is quickly updated to reflect the new topology. We had a host with an Alertmanager instance fail and observed the Prometheus prometheus_notifications_queue_length grow uncontrollably and we started getting Alert notification queue full, dropping alerts messages in the log.

Reproduction steps:

Deploy an Alertmanager cluster in HA mode
Configure Prometheus to discover Alertmanager using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config with SRV records
Kill one Alertmanager instance and remove that instance from the SRV record
Wait until the Prometheus notification queue fills

What did you expect to see?

We expected Prometheus to stop sending alerts to the failed Alertmanager quickly after the DNS record changed. We expected Prometheus to continue sending alerts to the Alertmanagers that remained in the DNS record.

What did you see instead? Under which circumstances?

We experienced a failure on a host running one of the Alertmanager instances. The failed instance become unreachable and it was removed from the DNS record within a few seconds. Prometheus continued to attempt to post alerts to the failed Alertmanager until we restarted the process. Because every request hit the context timeout, the notification queue started to grow uncontrollably and we started to see notifications dropped. This is the same behavior described in #7676.

This was unexpected for our configuration because the DNS records changed to reflect the missing Alertmanger very quickly. Upon investigation, we found that the prometheus_sd_discovered_targets{name="notify"} reflected the new topology almost immediately (which is exactly what we expected). However, prometheus_notifications_alertmanagers_discovered never reflected the change (until we manually restarted prometheus). This was unexpected.

We did some investigation and believe we have found the root cause: https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L346 performs a non-blocking write to https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L90 which is an unbuffered channel. For Alertmanager notifications, this is read by https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L310 which is effectively a non-blocking read when the notifier is under heavy load (because n.more is likely to be readable). This becomes a death-spiral because the long timeout gives more time for more alerts to enqueue which increases the chance that n.more will be readable. The spiral only stops once the maximum notification queue depth is reached. In our case this reached a steady-state of n.more always being readable and tsets never getting read. This means that service discovery targets for Alertmanager don't actually propagate to the notifier unless the loops in the manager and the notifier happen to perfectly line up. In the case where an Alertmanager instance has failed, the expected state of the notifier is to be in sendAll almost all the time because every attempt to post an alert will require waiting for the context to expire which is much longer than the time it takes to execute the select statement in Run. In practice, there's no reason to expect that the notifier will ever get new service discovery information in this situation.

This is a serious problem because it means any Alertmanager instance becomes a signal point of failure for the entire Prometheus+Alertmanager deployment. We observed that the rate of notifications to the other Alertmanager instances steeply declined and notifications began getting dropped.

We think this would behave better if the syncCh was buffered. I'll open a PR with that change shorty.

System information

Linux

Prometheus version

2.45.1

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

We saw both these messages repeatedly in the Prometheus logs: 

caller=manager.go:246 level=debug component="discovery manager notify" msg="Discovery receiver's channel was full so will retry the next cycle"

caller=notifier.go:371 level=warn component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions