prometheus_sd_updates_delayed_total firing, resulting in dropped alerts

What did you do?

I'm using pod service discovery for Prometheus' connection to Alertmanager in my setup.

Recently, when an alertmanager pod was evicted and recreated, prometheus SD failed to update it's set of discovered alertmanager-pod IPs for the recreated pod IP for ~1 hour. Some pertinent SD metrics during this time:

What did you expect to see?

I expected prometheus SD to successfully update instead of getting delayed and thus firing the prometheus_sd_updates_delayed_total metric.

What did you see instead? Under which circumstances?

Impact: this resulted in a 1 hour downtime in our ability to send alerts after the prometheus alert queue was filled.

Some sanity checks I did:

the prometheus containers are not getting CPU throttled
there isn't ~any contention on a host level for CPU across pods
I tried setting GOMAXPROCS equal to the container CPU requests and a number much larger than container/host CPUs, and neither change made a difference
doing a SIGHUP for prometheus can get prom out from this "bad state" to reset SD for the alertmanager pods

Notably, I didn't do a SIGHUP nor were there any external changes that would cause a reload. Prometheus SD eventually resolved by itself after the ~1 hour.

Are there any known/hypothesized reasons why this would happen?

System information

Linux 6.1.58+ x86_64

Prometheus version

Version	2.48.0
Revision	6d80b30990bc297d95b5c844e118c4011fad8054
Branch	HEAD
BuildUser	root@26117804242c
BuildDate	20231116-04:35:21
GoVersion	go1.21.4

Prometheus configuration file

alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers:
  - kubernetes_sd_configs:
    - namespaces:
        names:
        - monitoring
      role: pod
      selectors:
      - label: app.kubernetes.io/name=alertmanager
        role: pod
    relabel_configs:
    - action: keep
      regex: web
      source_labels:
      - __meta_kubernetes_pod_container_port_name



### Alertmanager version

```text
Branch:
HEAD
BuildDate:
20230824-11:11:58
BuildUser:
root@df8d7debeef4
GoVersion:
go1.20.7
Revision:
d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d
Version:
0.26.0

Alertmanager configuration file

No response

Logs

* I didn't have debug logs on, so nothing pertinent in Prom logs (nor Alertmanager logs, but this issue seems contained within Prom)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager configuration file

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager configuration file

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions