Description
What did you do?
I'm using pod service discovery for Prometheus' connection to Alertmanager in my setup.
Recently, when an alertmanager pod was evicted and recreated, prometheus SD failed to update it's set of discovered alertmanager-pod IPs for the recreated pod IP for ~1 hour. Some pertinent SD metrics during this time:
What did you expect to see?
I expected prometheus SD to successfully update instead of getting delayed and thus firing the prometheus_sd_updates_delayed_total metric.
What did you see instead? Under which circumstances?
Impact: this resulted in a 1 hour downtime in our ability to send alerts after the prometheus alert queue was filled.
Some sanity checks I did:
- the prometheus containers are not getting CPU throttled
- there isn't ~any contention on a host level for CPU across pods
- I tried setting GOMAXPROCS equal to the container CPU requests and a number much larger than container/host CPUs, and neither change made a difference
- doing a SIGHUP for prometheus can get prom out from this "bad state" to reset SD for the alertmanager pods
Notably, I didn't do a SIGHUP nor were there any external changes that would cause a reload. Prometheus SD eventually resolved by itself after the ~1 hour.
Are there any known/hypothesized reasons why this would happen?
System information
Linux 6.1.58+ x86_64
Prometheus version
Version 2.48.0
Revision 6d80b30990bc297d95b5c844e118c4011fad8054
Branch HEAD
BuildUser root@26117804242c
BuildDate 20231116-04:35:21
GoVersion go1.21.4
Prometheus configuration file
alerting:
alert_relabel_configs:
- action: labeldrop
regex: prometheus_replica
alertmanagers:
- kubernetes_sd_configs:
- namespaces:
names:
- monitoring
role: pod
selectors:
- label: app.kubernetes.io/name=alertmanager
role: pod
relabel_configs:
- action: keep
regex: web
source_labels:
- __meta_kubernetes_pod_container_port_name
### Alertmanager version
```text
Branch:
HEAD
BuildDate:
20230824-11:11:58
BuildUser:
root@df8d7debeef4
GoVersion:
go1.20.7
Revision:
d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d
Version:
0.26.0
Alertmanager configuration file
No response
Logs
* I didn't have debug logs on, so nothing pertinent in Prom logs (nor Alertmanager logs, but this issue seems contained within Prom)