fix(notifier): Fix target groups update starvation #14174

machine424 · 2024-05-31T14:54:12Z

chore(notifier): Split 'Run()' into two goroutines: one to receive target updates and trigger reloads and the other one to send notifications.

This is done to prevent the latter operation from blocking/starving the former, as previously, the tsets channel was consumed by the same goroutine that consumes and feeds the buffered n.more channel, the tsets channel was less likely to be ready as it's unbuffered and only fed every SDManager.updatert seconds.
See #13676 and #8768.
The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.
This uses a similar approach than scrape manager's

prometheus/scrape/manager.go

Lines 115 to 117 in efbd6e4

    
           // Reloading happens in the background so that it doesn't block receiving targets updates. 
        
           func (m *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) error { 
        
           	go m.reloader()

chore(notifier): add a reproducer for #13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.

The existing TestHangingNotifier that was added in #10948 doesn't really reflect the reality as the SD changes are manually fed into syncCh in a continuous way, whereas in reality, updates are only resent every updatert.
The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.

fix(notifier): take alertmanagerSet.mtx before checking alertmanagerSet.ams in sendAll

machine424 · 2024-05-31T15:01:57Z

The test is failing here https://github.com/prometheus/prometheus/actions/runs/9320102584/job/25656165868?pr=14174#step:6:234

machine424 · 2024-05-31T15:04:57Z

notifier/notifier_test.go

@@ -811,3 +815,148 @@ func TestHangingNotifier(t *testing.T) {
 		})
 	}
 }
+
+// TODO: renameit and even replace TestHangingNotifier with it.


we can keep both of them as the current one helped fixing 5c63ef2

machine424 · 2024-06-03T21:52:56Z

cc @Spaceman1701

Spaceman1701 · 2024-06-04T00:09:59Z

Thanks for doing this! The test looks great (and much better than mine in #13677). This solution works, but I'm not convinced it's preferable over a buffered channel - I'll put that comment into #13676 since that's where the discussion is.

roidelapluie

Nice !

roidelapluie · 2024-06-18T10:09:47Z

I think this change is really good

to show "targets groups update" starvation when the notifications queue is full and an Alertmanager is down. The existing `TestHangingNotifier` that was added in prometheus#10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`. The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality. Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>

…rget updates and trigger reloads and the other one to send notifications. This is done to prevent the latter operation from blocking/starving the former, as previously, the `tsets` channel was consumed by the same goroutine that consumes and feeds the buffered `n.more` channel, the `tsets` channel was less likely to be ready as it's unbuffered and only fed every `SDManager.updatert` seconds. See prometheus#13676 and prometheus#8768 The synchronization with the sendLoop goroutine is managed through the n.mtx mutex. This uses a similar approach than scrape manager's https://github.com/prometheus/prometheus/blob/efbd6e41c59ec8d6b7a0791c1fb337fdac53b4f2/scrape/manager.go#L115-L117 The old TestHangingNotifier was replaced by the new one to more closely reflect reality. Signed-off-by: machine424 <ayoubmrini424@gmail.com>

…et.ams in sendAll Signed-off-by: machine424 <ayoubmrini424@gmail.com>

charleskorn · 2024-06-20T01:48:28Z

notifier/notifier.go

+	for {
+		select {
+		case <-n.ctx.Done():
+			return


This has changed the behaviour of Run: previously, Run would only return once it had finished both sending notifications and processing target group updates.

However, with this change, Run could return before sending notifications has finished, and there's no way for callers to know when/if sending notifications has finished.

What do you think about waiting for sendLoop() to terminate before we return here?

sending "the current (if any) batch" of notifications.

In both cases, other notifications could still remain in the queue when Run returns (and that's what #14290 is trying to address).

Also, in both cases, the current batch (if any) will be cancelled: the request context is a child of the manager's. Previously, it was cleaner: we were sure to catch the request context cancellation and updated some metrics (even though it's useless as Prometheus is already stopping).

There is no test or doc/comments enforcing that old behavior. You can enforce that as part of your PR, or I can take care of it. I think just putting the "notifications sending" loop in the foreground should help with that.

kayengar

Thanks!

bboreham · 2024-07-19T11:35:50Z

Why is this labeled "chore" and not "bugfix" ?

machine424 · 2024-08-05T11:43:31Z

Why is this labeled "chore" and not "bugfix" ?

I have a bad habit of holding onto "chore" even when the original intention changes.

machine424 marked this pull request as draft May 31, 2024 14:54

machine424 force-pushed the busy-notifier branch 2 times, most recently from 7f8dd98 to 68640fe Compare May 31, 2024 15:04

machine424 commented May 31, 2024

View reviewed changes

machine424 mentioned this pull request May 31, 2024

Alertmanager service discovery does not update under heavy load #13676

Closed

machine424 force-pushed the busy-notifier branch 2 times, most recently from 7c6bc2f to ad40823 Compare June 3, 2024 16:23

machine424 requested a review from roidelapluie June 3, 2024 16:25

machine424 marked this pull request as ready for review June 3, 2024 16:25

machine424 marked this pull request as draft June 3, 2024 16:30

machine424 force-pushed the busy-notifier branch 2 times, most recently from 64ac864 to 5c63ef2 Compare June 3, 2024 21:47

machine424 marked this pull request as ready for review June 3, 2024 21:47

machine424 changed the title ~~chore: add a reproducer for https://github.com/prometheus/prometheus/issues/13676~~ chore(notifier): Fix target groups update starvation Jun 3, 2024

machine424 force-pushed the busy-notifier branch from 5c63ef2 to 4aa167e Compare June 3, 2024 21:56

machine424 marked this pull request as draft June 4, 2024 10:57

machine424 force-pushed the busy-notifier branch 2 times, most recently from bfaadab to c3447c2 Compare June 10, 2024 19:39

machine424 marked this pull request as ready for review June 11, 2024 08:42

machine424 mentioned this pull request Jun 12, 2024

notifier: optionally drain queued notifications before shutting down #14290

Merged

roidelapluie approved these changes Jun 18, 2024

View reviewed changes

machine424 and others added 3 commits June 19, 2024 09:28

fix(notifier): take alertmanagerSet.mtx before checking alertmanagerS…

8e5a99e

…et.ams in sendAll Signed-off-by: machine424 <ayoubmrini424@gmail.com>

machine424 force-pushed the busy-notifier branch from c3447c2 to 8e5a99e Compare June 19, 2024 07:28

machine424 enabled auto-merge (rebase) June 19, 2024 07:29

machine424 merged commit 70beda0 into prometheus:main Jun 19, 2024
25 checks passed

charleskorn reviewed Jun 20, 2024

View reviewed changes

pracucci mentioned this pull request Jun 20, 2024

Sync upstream prometheus grafana/mimir-prometheus#653

Merged

kayengar reviewed Jul 2, 2024

View reviewed changes

machine424 changed the title ~~chore(notifier): Fix target groups update starvation~~ fix(notifier): Fix target groups update starvation Aug 5, 2024

roidelapluie mentioned this pull request Aug 8, 2024

notifier: fix drop alertmanager on apply config step #13119

Closed

bwplotka mentioned this pull request Feb 11, 2025

Change discovery manager syncCh to buffered channel #13677

Closed

krajorama mentioned this pull request Mar 18, 2025

prometheus_sd_updates_delayed_total firing, resulting in dropped alerts #13419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(notifier): Fix target groups update starvation #14174

fix(notifier): Fix target groups update starvation #14174

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	// Reloading happens in the background so that it doesn't block receiving targets updates.
	func (m Manager) Run(tsets <-chan map[string][]targetgroup.Group) error {
	go m.reloader()

fix(notifier): Fix target groups update starvation #14174

fix(notifier): Fix target groups update starvation #14174

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!