8000 fix(notifier): Fix target groups update starvation by machine424 · Pull Request #14174 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix(notifier): Fix target groups update starvation #14174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 19, 2024

Conversation

machine424
Copy link
Member
@machine424 machine424 commented May 31, 2024

fixes #13676

chore(notifier): Split 'Run()' into two goroutines: one to receive target updates and trigger reloads and the other one to send notifications.

This is done to prevent the latter operation from blocking/starving the former, as previously, the tsets channel was consumed by the same goroutine that consumes and feeds the buffered n.more channel, the tsets channel was less likely to be ready as it's unbuffered and only fed every SDManager.updatert seconds.
See #13676 and #8768.
The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.
This uses a similar approach than scrape manager's

// Reloading happens in the background so that it doesn't block receiving targets updates.
func (m *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) error {
go m.reloader()


chore(notifier): add a reproducer for #13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.

The existing TestHangingNotifier that was added in #10948 doesn't really reflect the reality as the SD changes are manually fed into syncCh in a continuous way, whereas in reality, updates are only resent every updatert.
The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.


fix(notifier): take alertmanagerSet.mtx before checking alertmanagerSet.ams in sendAll

@machine424 machine424 marked this pull request as draft May 31, 2024 14:54
@machine424
Copy link
Member Author

@machine424 machine424 force-pushed the busy-notifier branch 2 times, most recently from 7f8dd98 to 68640fe Compare May 31, 2024 15:04
@@ -811,3 +815,148 @@ func TestHangingNotifier(t *testing.T) {
})
}
}

// TODO: renameit and even replace TestHangingNotifier with it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep both of them as the current one helped fixing 5c63ef2

@machine424 machine424 force-pushed the busy-notifier branch 2 times, most recently from 7c6bc2f to ad40823 Compare June 3, 2024 16:23
@machine424 machine424 requested a review from roidelapluie June 3, 2024 16:25
@machine424 machine424 marked this pull request as ready for review June 3, 2024 16:25
@machine424 machine424 marked this pull request as draft June 3, 2024 16:30
@machine424 machine424 force-pushed the busy-notifier branch 2 times, most recently from 64ac864 to 5c63ef2 Compare June 3, 2024 21:47
@machine424 machine424 marked this pull request as ready for review June 3, 2024 21:47
@machine424 machine424 changed the title chore: add a reproducer for https://github.com/prometheus/prometheus/issues/13676 chore(notifier): Fix target groups update starvation Jun 3, 2024
@machine424
Copy link
Member Author

cc @Spaceman1701

@Spaceman1701
Copy link
Contributor

Thanks for doing this! The test looks great (and much better than mine in #13677). This solution works, but I'm not convinced it's preferable over a buffered channel - I'll put that comment into #13676 since that's where the discussion is.

@machine424 machine424 marked this pull request as draft June 4, 2024 10:57
@machine424 machine424 force-pushed the busy-notifier branch 2 times, most recently from bfaadab to c3447c2 Compare June 10, 2024 19:39
@machine424 machine424 marked this pull request as ready for review June 11, 2024 08:42
Copy link
Member
@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

@roidelapluie
Copy link
Member

I think this change is really good

machine424 and others added 3 commits June 19, 2024 09:28
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.

The existing `TestHangingNotifier` that was added in prometheus#10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`.

The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>

Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>
…rget updates and trigger reloads and the other one to send notifications.

This is done to prevent the latter operation from blocking/starving the former, as previously, the `tsets` channel was consumed by the same goroutine that consumes and feeds the buffered `n.more` channel, the `tsets` channel was less likely to be ready as it's unbuffered and only fed every `SDManager.updatert` seconds.

See prometheus#13676 and prometheus#8768

The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.

This uses a similar approach than scrape manager's https://github.com/prometheus/prometheus/blob/efbd6e41c59ec8d6b7a0791c1fb337fdac53b4f2/scrape/manager.go#L115-L117

The old TestHangingNotifier was replaced by the new one to more closely reflect reality.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
…et.ams in sendAll

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
@machine424 machine424 enabled auto-merge (rebase) June 19, 2024 07:29
@machine424 machine424 merged commit 70beda0 into prometheus:main Jun 19, 2024
25 checks passed
for {
select {
case <-n.ctx.Done():
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed the behaviour of Run: previously, Run would only return once it had finished both sending notifications and processing target group updates.

However, with this change, Run could return before sending notifications has finished, and there's no way for callers to know when/if sending notifications has finished.

What do you think about waiting for sendLoop() to terminate before we return here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • sending "the current (if any) batch" of notifications.

In both cases, other notifications could still remain in the queue when Run returns (and that's what #14290 is trying to address).

Also, in both cases, the current batch (if any) will be cancelled: the request context is a child of the manager's. Previously, it was cleaner: we were sure to catch the request context cancellation and updated some metrics (even though it's useless as Prometheus is already stopping).

There is no test or doc/comments enforcing that old behavior. You can enforce that as part of your PR, or I can take care of it. I think just putting the "notifications sending" loop in the foreground should help with that.

Copy link
@kayengar kayengar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bboreham
Copy link
Member

Why is this labeled "chore" and not "bugfix" ?

@machine424 machine424 changed the title chore(notifier): Fix target groups update starvation fix(notifier): Fix target groups update starvation Aug 5, 2024
@machine424
Copy link
Member Author

Why is this labeled "chore" and not "bugfix" ?

I have a bad habit of holding onto "chore" even when the original intention changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alertmanager service discovery does not update under heavy load
6 participants
0