-
Notifications
You must be signed in to change notification settings - Fork 9.6k
notifier: Consider alert relabeling when deciding if alerts are dropped #15979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Good catch! Thanks for this. I agree with you on the issue, but I think a more generic reworking of the logic there is needed. I find Because relabeling is involved and each AMSet can have its own relabeling config, Suppose we start with 100 alerts passed to I think In that case, for an AMSet whose relabel config yields 0 alert, the send would be considered as successful (zero alert to send=successful send). Similarly, a send for an AMSet without any AM, should be considered successful. In the same vein, if 99 alerts are successfully sent to the first AMSet, but only 1 fails for the second, |
15ecb8f
to
84ffd58
Compare
Hey @machine424 that explanation was very helpful. Thank you! I've updated the PR to consider AM sets. If there are any sets without a success, |
To do this calculation safely per set, my implementation now processes each set serially. I could add another layered waitGroup to process AM sets in parallel. |
84ffd58
to
08368ce
Compare
Yes, I think we should keep running all routines concurrently as before. |
08368ce
to
bc63255
Compare
Ok now the code has a top-level waitGroup that handles each set in parallel. We still need an inner waitGroup to tell if at least one AM in the set got the alert. |
An alternative could be to leave the waitgroup logic as-is, but accumulate attempts/successes by AM protected with a lock, and then count them at the end: diff --git a/notifier/notifier.go b/notifier/notifier.go
index fbc37c29e..63e5f12b8 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -36,7 +36,6 @@ import (
"github.com/prometheus/common/promslog"
"github.com/prometheus/common/version"
"github.com/prometheus/sigv4"
- "go.uber.org/atomic"
"gopkg.in/yaml.v2"
"github.com/prometheus/prometheus/config"
@@ -553,9 +552,11 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
var (
wg sync.WaitGroup
- numSuccess atomic.Uint64
+ lock sync.Mutex
+ numSuccess = make(map[string]int, len(amSets))
+ numAttempt = make(map[string]int, len(amSets))
)
- for _, ams := range amSets {
+ for k, ams := range amSets {
var (
payload []byte
err error
@@ -617,18 +618,25 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(ams.cfg.Timeout))
defer cancel()
- go func(ctx context.Context, client *http.Client, url string, payload []byte, count int) {
- if err := n.sendOne(ctx, client, url, payload); err != nil {
+ go func(ctx context.Context, k string, client *http.Client, url string, payload []byte, count int) {
+ err := n.sendOne(ctx, client, url, payload)
+ if err != nil {
n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
n.metrics.errors.WithLabelValues(url).Add(float64(count))
- } else {
- numSuccess.Inc()
}
+
+ lock.Lock()
+ numAttempt[k]++
+ if err == nil {
+ numSuccess[k]++
+ }
+ lock.Unlock()
+
n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
n.metrics.sent.WithLabelValues(url).Add(float64(count))
wg.Done()
- }(ctx, ams.client, am.url().String(), payload, len(amAlerts))
+ }(ctx, k, ams.client, am.url().String(), payload, len(amAlerts))
}
ams.mtx.RUnlock()
@@ -636,7 +644,14 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
wg.Wait()
- return numSuccess.Load() > 0
+ // Return false if there are any sets which are attempted (e.g. not filtered
+ // out) but have no successes.
+ for k := range numAttempt {
+ if numAttempt[k] > 0 && numSuccess[k] == 0 {
+ return false
+ }
+ }
+ return true
}
func alertsToOpenAPIAlerts(alerts []*Alert) models.PostableAlerts { |
bc63255
to
baded7a
Compare
Thanks @cbroglie , that is a cleaner solution compared to two levels of waitgroups. |
Hey @machine424 , does the change look good now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can also work with one sync.Map
diff --git a/notifier/notifier.go b/notifier/notifier.go
index 8cdaf92a0..9b21c4d1c 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -551,10 +551,9 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
n.mtx.RUnlock()
var (
- wg sync.WaitGroup
- lock sync.Mutex
- numSuccess = make(map[string]int, len(amSets))
- numAttempt = make(map[string]int, len(amSets))
+ wg sync.WaitGroup
+
+ amSetCovered sync.Map
)
for k, ams := range amSets {
var (
@@ -612,6 +611,8 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
cachedPayload = nil
}
+ // Being here means len(ams.ams) > 0
+ amSetCovered.Store(k, false)
for _, am := range ams.ams {
wg.Add(1)
@@ -623,15 +624,10 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
if err != nil {
n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
n.metrics.errors.WithLabelValues(url).Add(float64(count))
+ } else {
+ amSetCovered.CompareAndSwap(k, false, true)
}
- lock.Lock()
- numAttempt[k]++
- if err == nil {
- numSuccess[k]++
- }
- lock.Unlock()
-
n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
n.metrics.sent.WithLabelValues(url).Add(float64(count))
@@ -646,12 +642,16 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
// Return false if there are any sets which were attempted (e.g. not filtered
// out) but have no successes.
- for k := range numAttempt {
- if numAttempt[k] > 0 && numSuccess[k] == 0 {
+ allAmSetsCovered := true
+ amSetCovered.Range(func(_, value any) bool {
+ if !value.(bool) {
+ allAmSetsCovered = false
return false
}
- }
- return true
+ return true
+ })
+
+ return allAmSetsCovered
}
thanks for the tests, if you could make sure the case "a set with no alertmanager" is considered as well it'd be great.
baded7a
to
4e348a3
Compare
Thanks, the single covered per set does the trick. I added an empty set to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can we update the PR title to make it more in line with what it should look like in the changelog? How about this:
|
Sure, that will make my life easier :) How about? I opened #16079 to reconsider what really |
When intentionally dropping all alerts in a batch, `SendAll` returns false, increasing the dropped_total metric. This makes it difficult to tell if there is a connection issue between Prometheus and AlertManager or a result of intentionally dropping alerts. Have `SendAll` return `true` when no batches were sent by keeping track of the number of AlertManager request attempts. If no attempts were made, then the send is successful. Fixes: prometheus#15978 Signed-off-by: Justin Cichra <jcichra@cloudflare.com>
4e348a3
to
78599d0
Compare
Thanks I renamed the commit and the PR title. |
When intentionally dropping all alerts in a batch,
SendAll
returns false, increasing the dropped_total metric.This makes it difficult to tell if there is a connection issue between Prometheus and AlertManager or a result of intentionally dropping alerts.
Have
SendAll
returntrue
when no batches were sent by keeping track of the number of AlertManager request attempts.If no attempts were made, then the send is successful.
Fixes: #15978
The logic was reworked see #15979 (comment) which fixes the above as well.