8000 Notification queue fills with single down AM instance · Issue #7676 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Notification queue fills with single down AM instance #7676
Open
@britcey

Description

@britcey

What did you do?

DC went down, taking 1 of 3 Alertmanager instances with it and also causing a large number of alerts

(there is cross DC monitoring right now for $reasons - that's being addressed, but is not directly relevant to this issue).

What did you expect to see?

Prometheus sends notifications to remaining 2 AM instances, skipping down AM instance.

Not sure if it'd be as straight-forward as a queue per AM - I could see odd timing issues with that.

What did you see instead? Under which circumstances?

prometheus_notifications_queue_length eventually filled and it started dropping notifications, even through a majority of AM instances were perfectly functional.

Alertmanagers are configured via static_configs.

Commenting-out the down AM instance from the config addressed the issue - things have been fine since then.

DC02 was down - most of the notifications were generated from the prom pair in DC01 and those notification_queues filled.

DC03 was also sending notifications, albeit at a far lower rate & prometheus_notifications_queue_length on those instances topped out ~ 300.

Environment

  • System information:

Linux 3.10.0-1127.el7.x86_64 x86_64

  • Prometheus version:

prometheus, version 2.19.2 (branch: HEAD, revision: c448ada)
build user: root@dd72efe1549d
build date: 20200626-09:02:20
go version: go1.14.4

  • Alertmanager version:

alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4

  • Prometheus configuration file:
    Relevant bit:
alerting:
  alert_relabel_configs:
    - regex: 'replica'
      action: labeldrop
  alertmanagers:
    - scheme: https
      path_prefix: alertmanager/
      basic_auth:
        username: prometheus
        password: "xxxxx"
      static_configs:
        - targets:
          - 'prod-am01.dc01'
          - 'prod-am02.dc02'
          - 'prod-am03.dc03'

prod-am02.dc02 was down (along with the rest of DC02) - commenting that out of the config fixed the issue.

  • Alertmanager configuration file:

Don't think it's relevant, but let me know.

  • Logs:
    Sample from DC01 prom instance:
Jul 24 12:15:26 prod-prom01 prometheus: level=warn ts=2020-07-24T12:15:26.539Z caller=notifier.go:379 component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=29

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0