8000 `prometheus_sd_updates_delayed_total` firing, resulting in dropped alerts · Issue #13419 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
prometheus_sd_updates_delayed_total firing, resulting in dropped alerts  #13419
Closed
@davidhwang-anthropic

Description

@davidhwang-anthropic

What did you do?

I'm using pod service discovery for Prometheus' connection to Alertmanager in my setup.

Recently, when an alertmanager pod was evicted and recreated, prometheus SD failed to update it's set of discovered alertmanager-pod IPs for the recreated pod IP for ~1 hour. Some pertinent SD metrics during this time:

What did you expect to see?

I expected prometheus SD to successfully update instead of getting delayed and thus firing the prometheus_sd_updates_delayed_total metric.

What did you see instead? Under which circumstances?

image

image

image

Impact: this resulted in a 1 hour downtime in our ability to send alerts after the prometheus alert queue was filled.

Some sanity checks I did:

  • the prometheus containers are not getting CPU throttled
  • there isn't ~any contention on a host level for CPU across pods
  • I tried setting GOMAXPROCS equal to the container CPU requests and a number much larger than container/host CPUs, and neither change made a difference
  • doing a SIGHUP for prometheus can get prom out from this "bad state" to reset SD for the alertmanager pods

Notably, I didn't do a SIGHUP nor were there any external changes that would cause a reload. Prometheus SD eventually resolved by itself after the ~1 hour.

Are there any known/hypothesized reasons why this would happen?

System information

Linux 6.1.58+ x86_64

Prometheus version

Version	2.48.0
Revision	6d80b30990bc297d95b5c844e118c4011fad8054
Branch	HEAD
BuildUser	root@26117804242c
BuildDate	20231116-04:35:21
GoVersion	go1.21.4

Prometheus configuration file

alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers:
  - kubernetes_sd_configs:
    - namespaces:
        names:
        - monitoring
      role: pod
      selectors:
      - label: app.kubernetes.io/name=alertmanager
        role: pod
    relabel_configs:
    - action: keep
      regex: web
      source_labels:
      - __meta_kubernetes_pod_container_port_name


### Alertmanager version

```text
Branch:
HEAD
BuildDate:
20230824-11:11:58
BuildUser:
root@df8d7debeef4
GoVersion:
go1.20.7
Revision:
d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d
Version:
0.26.0

Alertmanager configuration file

No response

Logs

* I didn't have debug logs on, so nothing pertinent in Prom logs (nor Alertmanager logs, but this issue seems contained within Prom)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0