notifier: Consider alert relabeling when deciding if alerts are dropped #15979

jrcichra · 2025-02-05T23:02:09Z

When intentionally dropping all alerts in a batch, SendAll returns false, increasing the dropped_total metric.

This makes it difficult to tell if there is a connection issue between Prometheus and AlertManager or a result of intentionally dropping alerts.

Have SendAll return true when no batches were sent by keeping track of the number of AlertManager request attempts.

If no attempts were made, then the send is successful.

Fixes: #15978

The logic was reworked see #15979 (comment) which fixes the above as well.

machine424 · 2025-02-06T17:34:21Z

Good catch! Thanks for this.

I agree with you on the issue, but I think a more generic reworking of the logic there is needed.

I find // It returns true if the alerts could be sent successfully to at least one Alertmanager. from the sendAll doc comment ambiguous/misleading.

Because relabeling is involved and each AMSet can have its own relabeling config, the alerts in that sentence no longer has any meaning.

Suppose we start with 100 alerts passed to sendAll, the first AMSet's relabeling would only keep one and sendAll will succeed to send it (thus the global numSuccess will be incremented), now the second AMSet's relabeling would keep 99 alerts and sendAll will not be able to send them to any AM in the set. And yet, sendAll would return true because numSuccess>0.

I think numSuccess (or equivalent) should be tracked per AMSet and sendAll should return true only if, for each AMSet, at least one AM within the set has received the alerts.

In that case, for an AMSet whose relabel config yields 0 alert, the send would be considered as successful (zero alert to send=successful send). Similarly, a send for an AMSet without any AM, should be considered successful.

In the same vein, if 99 alerts are successfully sent to the first AMSet, but only 1 fails for the second, dropped_total is still incremented by 100. It might be necessary to reconsider the dropped metrics, perhaps having one per AMSet or AM.

jrcichra · 2025-02-06T22:17:32Z

Hey @machine424 that explanation was very helpful. Thank you! I've updated the PR to consider AM sets.

If there are any sets without a success, sendAll returns false. If we never send anything at all there were no set failures so it returns true. I fixed up the tests to consider those situations.

jrcichra · 2025-02-06T22:18:59Z

To do this calculation safely per set, my implementation now processes each set serially. I could add another layered waitGroup to process AM sets in parallel.

machine424 · 2025-02-11T10:40:23Z

To do this calculation safely per set, my implementation now processes each set serially. I could add another layered waitGroup to process AM sets in parallel.

Yes, I think we should keep running all routines concurrently as before.

jrcichra · 2025-02-12T19:56:25Z

Ok now the code has a top-level waitGroup that handles each set in parallel. We still need an inner waitGroup to tell if at least one AM in the set got the alert.

cbroglie · 2025-02-15T00:08:59Z

Ok now the code has a top-level waitGroup that handles each set in parallel. We still need an inner waitGroup to tell if at least one AM in the set got the alert.

An alternative could be to leave the waitgroup logic as-is, but accumulate attempts/successes by AM protected with a lock, and then count them at the end:

diff --git a/notifier/notifier.go b/notifier/notifier.go
index fbc37c29e..63e5f12b8 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -36,7 +36,6 @@ import (
 	"github.com/prometheus/common/promslog"
 	"github.com/prometheus/common/version"
 	"github.com/prometheus/sigv4"
-	"go.uber.org/atomic"
 	"gopkg.in/yaml.v2"
 
 	"github.com/prometheus/prometheus/config"
@@ -553,9 +552,11 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
 	var (
 		wg         sync.WaitGroup
-		numSuccess atomic.Uint64
+		lock       sync.Mutex
+		numSuccess = make(map[string]int, len(amSets))
+		numAttempt = make(map[string]int, len(amSets))
 	)
-	for _, ams := range amSets {
+	for k, ams := range amSets {
 		var (
 			payload  []byte
 			err      error
@@ -617,18 +618,25 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 			ctx, cancel := context.WithTimeout(context.Background(), time.Duration(ams.cfg.Timeout))
 			defer cancel()
 
-			go func(ctx context.Context, client *http.Client, url string, payload []byte, count int) {
-				if err := n.sendOne(ctx, client, url, payload); err != nil {
+			go func(ctx context.Context, k string, client *http.Client, url string, payload []byte, count int) {
+				err := n.sendOne(ctx, client, url, payload)
+				if err != nil {
 					n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
 					n.metrics.errors.WithLabelValues(url).Add(float64(count))
-				} else {
-					numSuccess.Inc()
 				}
+
+				lock.Lock()
+				numAttempt[k]++
+				if err == nil {
+					numSuccess[k]++
+				}
+				lock.Unlock()
+
 				n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
 				n.metrics.sent.WithLabelValues(url).Add(float64(count))
 
 				wg.Done()
-			}(ctx, ams.client, am.url().String(), payload, len(amAlerts))
+			}(ctx, k, ams.client, am.url().String(), payload, len(amAlerts))
 		}
 
 		ams.mtx.RUnlock()
@@ -636,7 +644,14 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
 	wg.Wait()
 
-	return numSuccess.Load() > 0
+	// Return false if there are any sets which are attempted (e.g. not filtered
+	// out) but have no successes.
+	for k := range numAttempt {
+		if numAttempt[k] > 0 && numSuccess[k] == 0 {
+			return false
+		}
+	}
+	return true
 }
 
 func alertsToOpenAPIAlerts(alerts []*Alert) models.PostableAlerts {

jrcichra · 2025-02-18T15:12:45Z

Thanks @cbroglie , that is a cleaner solution compared to two levels of waitgroups.

jrcichra · 2025-02-25T15:44:53Z

8000

Hey @machine424 , does the change look good now?

machine424

I think it can also work with one sync.Map

diff --git a/notifier/notifier.go b/notifier/notifier.go
index 8cdaf92a0..9b21c4d1c 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -551,10 +551,9 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
        n.mtx.RUnlock()
 
        var (
-               wg         sync.WaitGroup
-               lock       sync.Mutex
-               numSuccess = make(map[string]int, len(amSets))
-               numAttempt = make(map[string]int, len(amSets))
+               wg sync.WaitGroup
+
+               amSetCovered sync.Map
        )
        for k, ams := range amSets {
                var (
@@ -612,6 +611,8 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
                        cachedPayload = nil
                }
 
+               // Being here means len(ams.ams) > 0
+               amSetCovered.Store(k, false)
                for _, am := range ams.ams {
                        wg.Add(1)
 
@@ -623,15 +624,10 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
                                if err != nil {
                                        n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
                                        n.metrics.errors.WithLabelValues(url).Add(float64(count))
+                               } else {
+                                       amSetCovered.CompareAndSwap(k, false, true)
                                }
 
-                               lock.Lock()
-                               numAttempt[k]++
-                               if err == nil {
-                                       numSuccess[k]++
-                               }
-                               lock.Unlock()
-
                                n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
                                n.metrics.sent.WithLabelValues(url).Add(float64(count))
 
@@ -646,12 +642,16 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
        // Return false if there are any sets which were attempted (e.g. not filtered
        // out) but have no successes.
-       for k := range numAttempt {
-               if numAttempt[k] > 0 && numSuccess[k] == 0 {
+       allAmSetsCovered := true
+       amSetCovered.Range(func(_, value any) bool {
+               if !value.(bool) {
+                       allAmSetsCovered = false
                        return false
                }
-       }
-       return true
+               return true
+       })
+
+       return allAmSetsCovered
 }

thanks for the tests, if you could make sure the case "a set with no alertmanager" is considered as well it'd be great.

notifier/notifier.go

jrcichra · 2025-02-25T22:38:58Z

Thanks, the single covered per set does the trick. I added an empty set to the sendAll() test which validates the default of true for covered.

machine424

lgtm, let's see if anyone has concerns
Thanks @jrcichra for taking care of this.
Thnaks @cbroglie for the suggestion.

SuperQ

LGTM

SuperQ · 2025-02-26T10:08:41Z

Can we update the PR title to make it more in line with what it should look like in the changelog?

How about this:

notifier: Don't count alerts towards dropped if they are droped by alert relabeling

machine424 · 2025-02-26T10:44:28Z

Can we update the PR title to make it more in line with what it should look like in the changelog?

How about this:
notifier: Don't count alerts towards dropped if they are droped by alert relabeling

Sure, that will make my life easier :)
This also will avoid considering a batch send as successful if only one AM set received it.

How about?
notifier: Consider alert relabeling when deciding if alerts are dropped.

I opened #16079 to reconsider what really dropped means, among others.

When intentionally dropping all alerts in a batch, `SendAll` returns false, increasing the dropped_total metric. This makes it difficult to tell if there is a connection issue between Prometheus and AlertManager or a result of intentionally dropping alerts. Have `SendAll` return `true` when no batches were sent by keeping track of the number of AlertManager request attempts. If no attempts were made, then the send is successful. Fixes: prometheus#15978 Signed-off-by: Justin Cichra <jcichra@cloudflare.com>

jrcichra · 2025-02-26T14:08:29Z

Thanks I renamed the commit and the PR title.

jrcichra force-pushed the sendall-remap branch 2 times, most recently from 15ecb8f to 84ffd58 Compare February 6, 2025 22:14

jrcichra force-pushed the sendall-remap branch from 84ffd58 to 08368ce Compare February 6, 2025 22:19

jrcichra force-pushed the sendall-remap branch from 08368ce to bc63255 Compare February 12, 2025 19:54

jrcichra force-pushed the sendall-remap branch from bc63255 to baded7a Compare February 18, 2025 15:11

machine424 reviewed Feb 25, 2025

View reviewed changes

notifier/notifier.go Show resolved Hide resolved

jrcichra force-pushed the sendall-remap branch from baded7a to 4e348a3 Compare February 25, 2025 22:34

machine424 changed the title ~~notifier: return true from SendAll when all alerts are dropped~~ notifier: Make SendAll return true iff at least one AM from every AMSet has received the corresponding alerts Feb 26, 2025

machine424 approved these changes Feb 26, 2025

View reviewed changes

SuperQ approved these changes Feb 26, 2025

View reviewed changes

jrcichra force-pushed the sendall-remap branch from 4e348a3 to 78599d0 Compare February 26, 2025 14:05

jrcichra changed the title ~~notifier: Make SendAll return true iff at least one AM from every AMSet has received the corresponding alerts~~ notifier: Consider alert relabeling when deciding if alerts are dropped Feb 26, 2025

machine424 merged commit 3f0de72 into prometheus:main Feb 28, 2025
27 checks passed

charleskorn mentioned this pull request Mar 21, 2025

Sync upstream Prometheus at 9cc4741 grafana/mimir-prometheus#852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

notifier: Consider alert relabeling when deciding if alerts are dropped #15979

notifier: Consider alert relabeling when deciding if alerts are dropped #15979

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

notifier: Consider alert relabeling when deciding if alerts are dropped #15979

notifier: Consider alert relabeling when deciding if alerts are dropped #15979

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!