`MemPostings.Delete()`: reduce locking/unlocking #13286

colega · 2023-12-13T16:31:03Z

(The description has been updated since I created this PR)

MemPostings.Delete is called from Head.gc(), i.e. it gets the IDs of the series that have churned.

With current implementation, it tried to acquire the Lock() for each label/value combination: this can be quite expensive for tenants who have lots of those combinations.

This changes the implementation to call Lock only once per each label name (not value), and only if it is affected by the series churn.

I'm also adding a benchmark, which I ran with -benchtime=10x: it's quite difficult to get meaningful result here as each time the operation called, the benchmarked entity is being changed.

The benchmark runs some read operations which would make the Lock() operation wait, which is the scenario that I'm trying to mitigate, as in a downstream project we've seen some profiles waiting continuously on this Lock() call.

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb/index
                                            │     main     │                 new                 │
                                            │    sec/op    │    sec/op     vs base               │
MemPostings_Delete/refs=1/readers=0-12         141.2m ± 5%   157.2m ±  1%  +11.29% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=1-12        1152.1m ± 1%   205.7m ±  5%  -82.14% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=10-12        10.424 ± 3%    1.149 ± 45%  -88.98% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=0-12       173.8m ± 5%   202.2m ± 13%  +16.36% (p=0.004 n=6)
MemPostings_Delete/refs=100/readers=1-12      1203.4m ± 1%   226.8m ±  5%  -81.15% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=10-12      10.791 ± 2%    1.164 ± 49%  -89.21% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=0-12     198.3m ± 7%   176.3m ±  2%  -11.08% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=1-12    1246.2m ± 2%   241.1m ±  4%  -80.66% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=10-12    10.571 ± 3%    1.122 ± 49%  -89.39% (p=0.002 n=6)
geomean                                         1.292        357.2m        -72.34%

I haven't found the reason for such a high variance in some of the new implementation's results.

I only ran memory benchmarks with readers=0, and the results are very similar:

                                           │   main_mem    │                new_mem                │
                                           │     B/op      │     B/op       vs base                │
MemPostings_Delete/refs=1/readers=0-12       28.17Mi ± ∞ ¹   27.65Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=100/readers=0-12     31.22Mi ± ∞ ¹   30.70Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=10000/readers=0-12   30.37Mi ± ∞ ¹   29.84Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                      29.89Mi         29.37Mi        -1.74%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                           │   main_mem   │                new_mem                │
                                           │  allocs/op   │  allocs/op    vs base                 │
MemPostings_Delete/refs=1/readers=0-12       36.000 ± ∞ ¹    8.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=100/readers=0-12      248.0 ± ∞ ¹    221.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=10000/readers=0-12   20.18k ± ∞ ¹   20.15k ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                       564.7          329.0        -41.74%

colega · 2024-06-06T14:30:11Z

I've recently seen a head GC taking 3h on a downstream project, and it looked like a possible lock contention, so I thought it would be interesting to recover this.

I have no idea how I could benchmark it. It looks like it only would have impact under query pressure.

bboreham · 2024-06-07T16:15:46Z

tsdb/index/postings.go

 			repl := make([]storage.SeriesRef, 0, len(p.m[n][l]))

+			// Only take the write lock when there's actually something to write.
+			p.mtx.RUnlock()


Unlocking here means another goroutine can jump in and alter p.m, which is unsafe because we are part-way through for n := range p.m on line 297.
I suspect this is why the original code copied the keys and values.

Thanks, good point. Reverted copying of keys & values to their slices in b68c9c7

Also added proper estimation of slice sizes to avoid resizing them.

MemPostings.Delete is called from Head.gc(), i.e. it gets the IDs of the series that have churned. I'd assume that many label values aren't affected by that churn at all, so it doesn't make sense to touch the lock while checking them. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega · 2024-06-09T16:25:26Z

I've optimized the code further, now locking at most once per label name in the storage, pushed a benchmark and updated the PR description.

colega · 2024-06-09T16:51:57Z

FWIW, I ran the benchmark with 1 iteration removing 100k series with 100 concurrent readers: it too 265s on main and 1.7s on this new implementation.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

jesusvazquez · 2024-06-10T10:02:56Z

/prombench main

prombot · 2024-06-10T10:02:59Z

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-13286 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

jesusvazquez

I've reviewed these changes live with @colega and they make sense to me.

I think the improvements will impact compaction times but also writes so I'm running Prombench to see if our existing live benchmark captures any of this.

Nice work

bboreham

The code change looks fine; I spent some time trying to explain it better.

I wonder if the whole operation might be more efficient if we started with the series to be deleted, and collected together each name/value pair affected. I guess it depends on the
relative size of the Postings index and the set of series to be deleted.
But that would be another PR entirely.

tsdb/index/postings.go

bboreham · 2024-06-10T11:11:33Z

tsdb/index/postings.go

 			found := false
-			for _, id := range p.m[n][l] {
+			for _, id := range p.m[n][vals[i]] {


Note for after this PR: we could copy p.m[n][vals[i]] in the loop earlier (line 313) and save hash lookups.

Since I was pushing the comment changes, I also changed this: 32f5c8d

tsdb/index/postings.go

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega · 2024-06-10T11:41:18Z

I wonder if the whole operation might be more efficient if we started with the series to be deleted, and collected together each name/value pair affected. I guess it depends on the
relative size of the Postings index and the set of series to be deleted.
But that would be another PR entirely.

Right, I think we can just hook to the call made to seriesLifecycleCallback:

prometheus/tsdb/head.go

Line 1966 in 08621be

s.seriesLifecycleCallback.PostDeletion(seriesSet)

It shouldn't be an issue to split this Delete call into multiple calls.

jesusvazquez · 2024-06-10T14:56:59Z

/prombench cancel

prombot · 2024-06-10T14:57:02Z

Benchmark cancel is in progress.

Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Update mimir-prometheus to incl MemPostings improvements Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Update CHANGELOG.md Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> --------- Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

@pracucci

This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

@pracucci

This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com> Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

@pracucci

…15242) This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

@pracucci

…15242) This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

@pracucci

…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

colega changed the title ~~MemPostings: reduce locking/unlocking~~ MemPostings.Delete(): reduce locking/unlocking Jan 2, 2024

colega closed this May 16, 2024

colega reopened this Jun 6, 2024

colega marked this pull request as ready for review June 6, 2024 14:29

colega requested a review from jesusvazquez as a code owner June 6, 2024 14:29

bboreham reviewed Jun 7, 2024

View reviewed changes

colega force-pushed the reduce-locking-on-mempostings-delete branch from 296ed0e to 7fe831b Compare June 8, 2024 15:27

colega added 3 commits June 9, 2024 12:06

Revert copying keys & vals to a slice

bb4601f

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Only lock once per label name

9da7d70

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega force-pushed the reduce-locking-on-mempostings-delete branch from 7fe831b to 9da7d70 Compare June 9, 2024 10:06

colega added 2 commits June 9, 2024 18:13

Add BenchmarkMemPostings_Delete

e63982a

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Add benchmark comment

dd9480c

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega mentioned this pull request Jun 10, 2024

MemPostings.PostingsForLabelMatching(): don't hold the mutex while matching #14286

Merged

Apply offline PR review

e739aa2

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

prombot added the prombench label Jun 10, 2024

jesusvazquez approved these changes Jun 10, 2024

View reviewed changes

colega mentioned this pull request Jun 10, 2024

[weekly-r290] Reduce locking on mempostings delete grafana/mimir-prometheus#642

Merged

bboreham approved these changes Jun 10, 2024

View reviewed changes

colega added 2 commits June 10, 2024 13:34

Update comments

b2dec7d

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Extract refs := p.m[n][vals[i]] lookup

32f5c8d

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

jesusvazquez merged commit 2dc177d into prometheus:main Jun 10, 2024
25 checks passed

colega mentioned this pull request Jun 10, 2024

Sync upstream prometheus to bring the MemPostings improvements. grafana/mimir-prometheus#645

Merged

colega added a commit to grafana/mimir that referenced this pull request Jun 10, 2024

Update mimir-prometheus to incl MemPostings improvements

e11be5b

Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega mentioned this pull request Jun 10, 2024

Update mimir-prometheus to incl MemPostings improvements grafana/mimir#8327

Merged

4 tasks

colega added a commit to grafana/mimir that referenced this pull request Jun 10, 2024

Update mimir-prometheus to incl MemPostings improvements

b437df6

Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

colega deleted the reduce-locking-on-mempostings-delete branch June 17, 2024 11:39

colega mentioned this pull request Oct 29, 2024

Revert "Process MemPostings.Delete() with GOMAXPROCS workers" #15239

Merged

colega mentioned this pull request Oct 29, 2024

MemPostings.Delete(): make pauses to unlock and let the readers read #15242

Merged

verejoel mentioned this pull request Nov 27, 2024

MemPostings.Delete(): make pauses to unlock and let the readers read … verejoel/prometheus#1

Open

verejoel mentioned this pull request Dec 17, 2024

MemPostings.Delete(): make pauses to unlock and let the readers read open-ch/prometheus#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`MemPostings.Delete()`: reduce locking/unlocking #13286

`MemPostings.Delete()`: reduce locking/unlocking #13286

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MemPostings.Delete(): reduce locking/unlocking #13286

MemPostings.Delete(): reduce locking/unlocking #13286

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`MemPostings.Delete()`: reduce locking/unlocking #13286

`MemPostings.Delete()`: reduce locking/unlocking #13286