8000 notifier: fix drop alertmanager on apply config step by tommy0 · Pull Request #13119 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

notifier: fix drop alertmanager on apply config step #13119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tommy0
Copy link
Contributor
@tommy0 tommy0 commented Nov 9, 2023

When the config is applied, set of alertmanagers are recreated with an empty list of URLs (ams []alertmanager init alertmanagerSet). Before reload will happen, sendAll will skip the notification batch, because alertmnager list is empty ( skip for ). I think it is necessary to save the list of alertmanagers when applying the config, since it will be updated at the next reload step.

A test case is added to demonstrate issue above.

Fixes #13064

In our cases, we observe that metric prometheus_notifications_dropped_total increases and heartbeat notifications in alertmanager are skipped, this fix repare this.

Signed-off-by: tommy0 <tommy0@mail.ru>
@tommy0 tommy0 force-pushed the fix-drop-alertmanager-on-reload branch from d4838e8 to 7fe64e9 Compare November 9, 2023 16:08
@roidelapluie roidelapluie self-assigned this Nov 12, 2023
@roidelapluie
Copy link
Member

We already have a proper fix for this, which is the command line flag --enable-feature=new-service-discovery-manager.

Can you please try it?

@tommy0
Copy link
Contributor Author
tommy0 commented Nov 14, 2023

@roidelapluie
I posted a comment in issue. Can we discuss about this in issue?

@2nick
Copy link
2nick commented Nov 23, 2023

@roidelapluie any news?

This bug can't be fixed with new service discovery and it looks important enough to me as users lose some alerts just because of config reload.

@roidelapluie
Copy link
Member

I need to continue testing this. I have started already.

@tommy0
Copy link
Contributor Author
tommy0 commented Jan 18, 2024

@roidelapluie Any updates?

@tommy0
Copy link
Contributor Author
tommy0 commented Mar 4, 2024

@roidelapluie sorry for ping

Copy link
Member
@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across this, again.
Thanks for looking into this.
I have some remarks.

@@ -264,6 +264,14 @@ func (n *Manager) ApplyConfig(conf *config.Config) error {
return err
}

// If alertmanagers already exists, don't skip it, in reload step will be refreshed new targets.
oldAmSets, ok := n.alertmanagers[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we should rely on k for this, given how it’s constructed:

prometheus/config/config.go

Lines 910 to 917 in 5c41768

// ToMap converts a slice of *AlertmanagerConfig to a map.
func (a AlertmanagerConfigs) ToMap() map[string]*AlertmanagerConfig {
ret := make(map[string]*AlertmanagerConfig)
for i := range a {
ret[fmt.Sprintf("config-%d", i)] = a[i]
}
return ret
}
. It only takes the config position in the array into account. Reordering the configs or changing the entire SD config for an entry may temporarily mess with its ams which isn't desired.

I'd expect a test case for that as well.

I think we'll need to compare the configs as we do in other places. If the AlertmanagerConfig is already known, temporarily use the corresponding alertmanagerSet's ams.

We'll need to agree on certain behaviors:

  • Should we set the existing droppedAms as well?
  • Perhaps we don't want the entire AlertmanagerConfig to be the same. Maybe equal ServiceDiscoveryConfigs (and RelabelConfigs) are sufficient, but it wouldn't be safe.

@roidelapluie
8000 Copy link
Member

I think this is fixed by #14174

@machine424
Copy link
Member

I don't think #14174 was meant to fix this, the TestReApplyConfig added here is still failing on main.

I'll try to revive this with the fix I proposed here #13119 (comment)

@machine424
Copy link
Member

we can close this as #14987 was merged.
Thanks again for looking into this.

@machine424 machine424 closed this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty activeAlertmanagers list
4 participants
0