-
Notifications
You must be signed in to change notification settings - Fork 9.6k
fix(notifier): Fix target groups update starvation #14174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test is failing here https://github.com/prometheus/prometheus/actions/runs/9320102584/job/25656165868?pr=14174#step:6:234 |
7f8dd98
to
68640fe
Compare
notifier/notifier_test.go
Outdated
@@ -811,3 +815,148 @@ func TestHangingNotifier(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
// TODO: renameit and even replace TestHangingNotifier with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep both of them as the current one helped fixing 5c63ef2
7c6bc2f
to
ad40823
Compare
64ac864
to
5c63ef2
Compare
bfaadab
to
c3447c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice !
I think this change is really good |
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager is down. The existing `TestHangingNotifier` that was added in prometheus#10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`. The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality. Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: Ethan Hunter <ehunter@hudson-trading.com>
…rget updates and trigger reloads and the other one to send notifications. This is done to prevent the latter operation from blocking/starving the former, as previously, the `tsets` channel was consumed by the same goroutine that consumes and feeds the buffered `n.more` channel, the `tsets` channel was less likely to be ready as it's unbuffered and only fed every `SDManager.updatert` seconds. See prometheus#13676 and prometheus#8768 The synchronization with the sendLoop goroutine is managed through the n.mtx mutex. This uses a similar approach than scrape manager's https://github.com/prometheus/prometheus/blob/efbd6e41c59ec8d6b7a0791c1fb337fdac53b4f2/scrape/manager.go#L115-L117 The old TestHangingNotifier was replaced by the new one to more closely reflect reality. Signed-off-by: machine424 <ayoubmrini424@gmail.com>
…et.ams in sendAll Signed-off-by: machine424 <ayoubmrini424@gmail.com>
c3447c2
to
8e5a99e
Compare
for { | ||
select { | ||
case <-n.ctx.Done(): | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has changed the behaviour of Run
: previously, Run
would only return once it had finished both sending notifications and processing target group updates.
However, with this change, Run
could return before sending notifications has finished, and there's no way for callers to know when/if sending notifications has finished.
What do you think about waiting for sendLoop()
to terminate before we return here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sending "the current (if any) batch" of notifications
.
In both cases, other notifications could still remain in the queue when Run
returns (and that's what #14290 is trying to address).
Also, in both cases, the current batch (if any) will be cancelled: the request context is a child of the manager's. Previously, it was cleaner: we were sure to catch the request context cancellation and updated some metrics (even though it's useless as Prometheus is already stopping).
There is no test or doc/comments enforcing that old behavior. You can enforce that as part of your PR, or I can take care of it. I think just putting the "notifications sending" loop in the foreground should help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Why is this labeled "chore" and not "bugfix" ? |
I have a bad habit of holding onto "chore" even when the original intention changes. |
fixes #13676
chore(notifier): Split 'Run()' into two goroutines: one to receive target updates and trigger reloads and the other one to send notifications.
This is done to prevent the latter operation from blocking/starving the former, as previously, the
tsets
channel was consumed by the same goroutine that consumes and feeds the bufferedn.more
channel, thetsets
channel was less likely to be ready as it's unbuffered and only fed everySDManager.updatert
seconds.See #13676 and #8768.
The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.
This uses a similar approach than scrape manager's
prometheus/scrape/manager.go
Lines 115 to 117 in efbd6e4
chore(notifier): add a reproducer for #13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.
The existing
TestHangingNotifier
that was added in #10948 doesn't really reflect the reality as the SD changes are manually fed intosyncCh
in a continuous way, whereas in reality, updates are only resent everyupdatert
.The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.
fix(notifier): take alertmanagerSet.mtx before checking alertmanagerSet.ams in sendAll