8000 notifier: Consider alert relabeling when deciding if alerts are dropped by jrcichra · Pull Request #15979 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

notifier: Consider alert relabeling when deciding if alerts are dropped #15979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2025

Conversation

jrcichra
Copy link
Contributor
@jrcichra jrcichra commented Feb 5, 2025

When intentionally dropping all alerts in a batch, SendAll returns false, increasing the dropped_total metric.

This makes it difficult to tell if there is a connection issue between Prometheus and AlertManager or a result of intentionally dropping alerts.

Have SendAll return true when no batches were sent by keeping track of the number of AlertManager request attempts.

If no attempts were made, then the send is successful.

Fixes: #15978

The logic was reworked see #15979 (comment) which fixes the above as well.

@machine424
Copy link
Member

Good catch! Thanks for this.

I agree with you on the issue, but I think a more generic reworking of the logic there is needed.

I find // It returns true if the alerts could be sent successfully to at least one Alertmanager. from the sendAll doc comment ambiguous/misleading.

Because relabeling is involved and each AMSet can have its own relabeling config, the alerts in that sentence no longer has any meaning.

Suppose we start with 100 alerts passed to sendAll, the first AMSet's relabeling would only keep one and sendAll will succeed to send it (thus the global numSuccess will be incremented), now the second AMSet's relabeling would keep 99 alerts and sendAll will not be able to send them to any AM in the set. And yet, sendAll would return true because numSuccess>0.

I think numSuccess (or equivalent) should be tracked per AMSet and sendAll should return true only if, for each AMSet, at least one AM within the set has received the alerts.

In that case, for an AMSet whose relabel config yields 0 alert, the send would be considered as successful (zero alert to send=successful send). Similarly, a send for an AMSet without any AM, should be considered successful.

In the same vein, if 99 alerts are successfully sent to the first AMSet, but only 1 fails for the second, dropped_total is still incremented by 100. It might be necessary to reconsider the dropped metrics, perhaps having one per AMSet or AM.

@jrcichra jrcichra force-pushed the sendall-remap branch 2 times, most recently from 15ecb8f to 84ffd58 Compare February 6, 2025 22:14
@jrcichra
Copy link
Contributor Author
jrcichra commented Feb 6, 2025

Hey @machine424 that explanation was very helpful. Thank you! I've updated the PR to consider AM sets.

If there are any sets without a success, sendAll returns false. If we never send anything at all there were no set failures so it returns true. I fixed up the tests to consider those situations.

@jrcichra
Copy link
Contributor Author
jrcichra commented Feb 6, 2025

To do this calculation safely per set, my implementation now processes each set serially. I could add another layered waitGroup to process AM sets in parallel.

@machine424
Copy link
Member

To do this calculation safely per set, my implementation now processes each set serially. I could add another layered waitGroup to process AM sets in parallel.

Yes, I think we should keep running all routines concurrently as before.

@jrcichra
Copy link
Contributor Author

Ok now the code has a top-level waitGroup that handles each set in parallel. We still need an inner waitGroup to tell if at least one AM in the set got the alert.

@cbroglie
Copy link
cbroglie commented Feb 15, 2025

Ok now the code has a top-level waitGroup that handles each set in parallel. We still need an inner waitGroup to tell if at least one AM in the set got the alert.

An alternative could be to leave the waitgroup logic as-is, but accumulate attempts/successes by AM protected with a lock, and then count them at the end:

diff --git a/notifier/notifier.go b/notifier/notifier.go
index fbc37c29e..63e5f12b8 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -36,7 +36,6 @@ import (
 	"github.com/prometheus/common/promslog"
 	"github.com/prometheus/common/version"
 	"github.com/prometheus/sigv4"
-	"go.uber.org/atomic"
 	"gopkg.in/yaml.v2"
 
 	"github.com/prometheus/prometheus/config"
@@ -553,9 +552,11 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
 	var (
 		wg         sync.WaitGroup
-		numSuccess atomic.Uint64
+		lock       sync.Mutex
+		numSuccess = make(map[string]int, len(amSets))
+		numAttempt = make(map[string]int, len(amSets))
 	)
-	for _, ams := range amSets {
+	for k, ams := range amSets {
 		var (
 			payload  []byte
 			err      error
@@ -617,18 +618,25 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 			ctx, cancel := context.WithTimeout(context.Background(), time.Duration(ams.cfg.Timeout))
 			defer cancel()
 
-			go func(ctx context.Context, client *http.Client, url string, payload []byte, count int) {
-				if err := n.sendOne(ctx, client, url, payload); err != nil {
+			go func(ctx context.Context, k string, client *http.Client, url string, payload []byte, count int) {
+				err := n.sendOne(ctx, client, url, payload)
+				if err != nil {
 					n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
 					n.metrics.errors.WithLabelValues(url).Add(float64(count))
-				} else {
-					numSuccess.Inc()
 				}
+
+				lock.Lock()
+				numAttempt[k]++
+				if err == nil {
+					numSuccess[k]++
+				}
+				lock.Unlock()
+
 				n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
 				n.metrics.sent.WithLabelValues(url).Add(float64(count))
 
 				wg.Done()
-			}(ctx, ams.client, am.url().String(), payload, len(amAlerts))
+			}(ctx, k, ams.client, am.url().String(), payload, len(amAlerts))
 		}
 
 		ams.mtx.RUnlock()
@@ -636,7 +644,14 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
 	wg.Wait()
 
-	return numSuccess.Load() > 0
+	// Return false if there are any sets which are attempted (e.g. not filtered
+	// out) but have no successes.
+	for k := range numAttempt {
+		if numAttempt[k] > 0 && numSuccess[k] == 0 {
+			return false
+		}
+	}
+	return true
 }
 
 func alertsToOpenAPIAlerts(alerts []*Alert) models.PostableAlerts {

@jrcichra
Copy link
Contributor Author

Thanks @cbroglie , that is a cleaner solution compared to two levels of waitgroups.

@jrcichra
Copy link
Contributor Author
8000

Hey @machine424 , does the change look good now?

Copy link
Member
@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can also work with one sync.Map

diff --git a/notifier/notifier.go b/notifier/notifier.go
index 8cdaf92a0..9b21c4d1c 100644
--- a/notifier/notifier.go
+++ b/notifier/notifier.go
@@ -551,10 +551,9 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
        n.mtx.RUnlock()
 
        var (
-               wg         sync.WaitGroup
-               lock       sync.Mutex
-               numSuccess = make(map[string]int, len(amSets))
-               numAttempt = make(map[string]int, len(amSets))
+               wg sync.WaitGroup
+
+               amSetCovered sync.Map
        )
        for k, ams := range amSets {
                var (
@@ -612,6 +611,8 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
                        cachedPayload = nil
                }
 
+               // Being here means len(ams.ams) > 0
+               amSetCovered.Store(k, false)
                for _, am := range ams.ams {
                        wg.Add(1)
 
@@ -623,15 +624,10 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
                                if err != nil {
                                        n.logger.Error("Error sending alerts", "alertmanager", url, "count", count, "err", err)
                                        n.metrics.errors.WithLabelValues(url).Add(float64(count))
+                               } else {
+                                       amSetCovered.CompareAndSwap(k, false, true)
                                }
 
-                               lock.Lock()
-                               numAttempt[k]++
-                               if err == nil {
-                                       numSuccess[k]++
-                               }
-                               lock.Unlock()
-
                                n.metrics.latency.WithLabelValues(url).Observe(time.Since(begin).Seconds())
                                n.metrics.sent.WithLabelValues(url).Add(float64(count))
 
@@ -646,12 +642,16 @@ func (n *Manager) sendAll(alerts ...*Alert) bool {
 
        // Return false if there are any sets which were attempted (e.g. not filtered
        // out) but have no successes.
-       for k := range numAttempt {
-               if numAttempt[k] > 0 && numSuccess[k] == 0 {
+       allAmSetsCovered := true
+       amSetCovered.Range(func(_, value any) bool {
+               if !value.(bool) {
+                       allAmSetsCovered = false
                        return false
                }
-       }
-       return true
+               return true
+       })
+
+       return allAmSetsCovered
 }

thanks for the tests, if you could make sure the case "a set with no alertmanager" is considered as well it'd be great.

@jrcichra
Copy link
Contributor Author

Thanks, the single covered per set does the trick. I added an empty set to the sendAll() test which validates the default of true for covered.

@machine424 machine424 changed the title notifier: return true from SendAll when all alerts are dropped notifier: Make SendAll return true iff at least one AM from every AMSet has received the corresponding alerts Feb 26, 2025
Copy link
Member
@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's see if anyone has concerns
Thanks @jrcichra for taking care of this.
Thnaks @cbroglie for the suggestion.

Copy link
Member
@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ
Copy link
Member
SuperQ commented Feb 26, 2025

Can we update the PR title to make it more in line with what it should look like in the changelog?

How about this:

notifier: Don't count alerts towards dropped if they are droped by alert relabeling

@machine424
Copy link
Member
machine424 commented Feb 26, 2025

Can we update the PR title to make it more in line with what it should look like in the changelog?

How about this:

notifier: Don't count alerts towards dropped if they are droped by alert relabeling

Sure, that will make my life easier :)
This also will avoid considering a batch send as successful if only one AM set received it.

How about?
notifier: Consider alert relabeling when deciding if alerts are dropped.


I opened #16079 to reconsider what really dropped means, among others.

When intentionally dropping all alerts in a batch, `SendAll` returns
false, increasing the dropped_total metric.

This makes it difficult to tell if there is a connection issue between
Prometheus and AlertManager or a result of intentionally dropping alerts.

Have `SendAll` return `true` when no batches were sent by keeping track
of the number of AlertManager request attempts.

If no attempts were made, then the send is successful.

Fixes: prometheus#15978

Signed-off-by: Justin Cichra <jcichra@cloudflare.com>
@jrcichra jrcichra changed the title notifier: Make SendAll return true iff at least one AM from every AMSet has received the corresponding alerts notifier: Consider alert relabeling when deciding if alerts are dropped Feb 26, 2025
@jrcichra
Copy link
Contributor Author

Thanks I renamed the commit and the PR title.

@machine424 machine424 merged commit 3f0de72 into prometheus:main Feb 28, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheus_notifications_dropped_total increases with an alert_relabel_config dropping alerts on all configured AlertManagers
4 participants
0