8000 `MemPostings.Delete()`: reduce locking/unlocking by colega · Pull Request #13286 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

MemPostings.Delete(): reduce locking/unlocking #13286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

colega
Copy link
Contributor
@colega colega commented Dec 13, 2023

(The description has been updated since I created this PR)

MemPostings.Delete is called from Head.gc(), i.e. it gets the IDs of the series that have churned.

With current implementation, it tried to acquire the Lock() for each label/value combination: this can be quite expensive for tenants who have lots of those combinations.

This changes the implementation to call Lock only once per each label name (not value), and only if it is affected by the series churn.

I'm also adding a benchmark, which I ran with -benchtime=10x: it's quite difficult to get meaningful result here as each time the operation called, the benchmarked entity is being changed.

The benchmark runs some read operations which would make the Lock() operation wait, which is the scenario that I'm trying to mitigate, as in a downstream project we've seen some profiles waiting continuously on this Lock() call.

goos: darwin
goarch: arm64
pkg: github.com/prometheus/prometheus/tsdb/index
                                            │     main     │                 new                 │
                                            │    sec/op    │    sec/op     vs base               │
MemPostings_Delete/refs=1/readers=0-12         141.2m ± 5%   157.2m ±  1%  +11.29% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=1-12        1152.1m ± 1%   205.7m ±  5%  -82.14% (p=0.002 n=6)
MemPostings_Delete/refs=1/readers=10-12        10.424 ± 3%    1.149 ± 45%  -88.98% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=0-12       173.8m ± 5%   202.2m ± 13%  +16.36% (p=0.004 n=6)
MemPostings_Delete/refs=100/readers=1-12      1203.4m ± 1%   226.8m ±  5%  -81.15% (p=0.002 n=6)
MemPostings_Delete/refs=100/readers=10-12      10.791 ± 2%    1.164 ± 49%  -89.21% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=0-12     198.3m ± 7%   176.3m ±  2%  -11.08% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=1-12    1246.2m ± 2%   241.1m ±  4%  -80.66% (p=0.002 n=6)
MemPostings_Delete/refs=10000/readers=10-12    10.571 ± 3%    1.122 ± 49%  -89.39% (p=0.002 n=6)
geomean                                         1.292        357.2m        -72.34%

I haven't found the reason for such a high variance in some of the new implementation's results.

I only ran memory benchmarks with readers=0, and the results are very similar:

                                           │   main_mem    │                new_mem                │
                                           │     B/op      │     B/op       vs base                │
MemPostings_Delete/refs=1/readers=0-12       28.17Mi ± ∞ ¹   27.65Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=100/readers=0-12     31.22Mi ± ∞ ¹   30.70Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=10000/readers=0-12   30.37Mi ± ∞ ¹   29.84Mi ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                      29.89Mi         29.37Mi        -1.74%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                           │   main_mem   │                new_mem                │
                                           │  allocs/op   │  allocs/op    vs base                 │
MemPostings_Delete/refs=1/readers=0-12       36.000 ± ∞ ¹    8.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=100/readers=0-12      248.0 ± ∞ ¹    221.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
MemPostings_Delete/refs=10000/readers=0-12   20.18k ± ∞ ¹   20.15k ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                       564.7          329.0        -41.74%

@colega colega changed the title MemPostings: reduce locking/unlocking MemPostings.Delete(): reduce locking/unlocking Jan 2, 2024
@colega colega closed this May 16, 2024
@colega colega reopened this Jun 6, 2024
@colega colega marked this pull request as ready for review June 6, 2024 14:29
@colega colega requested a review from jesusvazquez as a code owner June 6, 2024 14:29
@colega
Copy link
Contributor Author
colega commented Jun 6, 2024

I've recently seen a head GC taking 3h on a downstream project, and it looked like a possible lock contention, so I thought it would be interesting to recover this.

I have no idea how I could benchmark it. It looks like it only would have impact under query pressure.

repl := make([]storage.SeriesRef, 0, len(p.m[n][l]))

// Only take the write lock when there's actually something to write.
p.mtx.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlocking here means another goroutine can jump in and alter p.m, which is unsafe because we are part-way through for n := range p.m on line 297.
I suspect this is why the original code copied the keys and values.

Copy link
Contributor Author
@colega colega Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point. Reverted copying of keys & values to their slices in b68c9c7

Also added proper estimation of slice sizes to avoid resizing them.

@colega colega force-pushed the reduce-locking-on-mempostings-delete branch from 296ed0e to 7fe831b Compare June 8, 2024 15:27
colega added 3 commits June 9, 2024 12:06
MemPostings.Delete is called from Head.gc(), i.e. it gets the IDs of the
series that have churned.

I'd assume that many label values aren't affected by that churn at all,
so it doesn't make sense to touch the lock while checking them.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
@colega colega force-pushed the reduce-locking-on-mempostings-delete branch from 7fe831b to 9da7d70 Compare June 9, 2024 10:06
colega added 2 commits June 9, 2024 18:13
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
@colega
Copy link
Contributor Author
colega commented Jun 9, 2024

I've optimized the code further, now locking at most once per label name in the storage, pushed a benchmark and updated the PR description.

@colega
Copy link
Contributor Author
colega commented Jun 9, 2024

FWIW, I ran the benchmark with 1 iteration removing 100k series with 100 concurrent readers: it too 265s on main and 1.7s on this new implementation.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
@jesusvazquez
Copy link
Member

/prombench main

@prombot
Copy link
Contributor
prombot commented Jun 10, 2024

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-13286 and main

After successful deployment, the benchmarking results can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

Copy link
Member
@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed these changes live with @colega and they make sense to me.

I think the improvements will impact compaction times but also writes so I'm running Prombench to see if our existing live benchmark captures any of this.

Nice work

Copy link
Member
@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks fine; I spent some time trying to explain it better.

I wonder if the whole operation might be more efficient if we started with the series to be deleted, and collected together each name/value pair affected. I guess it depends on the
relative size of the Postings index and the set of series to be deleted.
But that would be another PR entirely.

found := false
for _, id := range p.m[n][l] {
for _, id := range p.m[n][vals[i]] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for after this PR: we could copy p.m[n][vals[i]] in the loop earlier (line 313) and save hash lookups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was pushing the comment changes, I also changed this: 32f5c8d

colega added 2 commits June 10, 2024 13:34
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
@colega
Copy link
Contributor Author
colega commented Jun 10, 2024

I wonder if the whole operation might be more efficient if we started with the series to be deleted, and collected together each name/value pair affected. I guess it depends on the
relative size of the Postings index and the set of series to be deleted.
But that would be another PR entirely.

Right, I think we can just hook to the call made to seriesLifecycleCallback:

s.seriesLifecycleCallback.PostDeletion(seriesSet)

It shouldn't be an issue to split this Delete call into multiple calls.

@jesusvazquez jesusvazquez merged commit 2dc177d into prometheus:main Jun 10, 2024
25 checks passed
@jesusvazquez
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor
prombot commented Jun 10, 2024

Benchmark cancel is in progress.

colega added a commit to grafana/mimir that referenced this pull request Jun 10, 2024
Updating to bring changes from:
- prometheus/prometheus#13286
- prometheus/prometheus#14286

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
colega added a commit to grafana/mimir that referenced this pull request Jun 10, 2024
Updating to bring changes from:
- prometheus/prometheus#13286
- prometheus/prometheus#14286

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
colega added a commit to grafana/mimir that referenced this pull request Jun 10, 2024
* Update mimir-prometheus to incl MemPostings improvements

Updating to bring changes from:
- prometheus/prometheus#13286
- prometheus/prometheus#14286

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Update CHANGELOG.md

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

---------

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
@colega colega deleted the reduce-locking-on-mempostings-delete branch June 17, 2024 11:39
colega added a commit to colega/prometheus that referenced this pull request Oct 29, 2024
This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
colega added a commit to colega/prometheus that referenced this pull request Oct 29, 2024
This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
jesusvazquez pushed a commit that referenced this pull request Nov 5, 2024
…15242)

This introduces back some unlocking that was removed in #13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to #13286 we're not processing all the label-values anymore,
but only the affected ones, because of #14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
juliusv pushed a commit that referenced this pull request Nov 5, 2024
…15242)

This introduces back some unlocking that was removed in #13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to #13286 we're not processing all the label-values anymore,
but only the affected ones, because of #14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
verejoel pushed a commit to verejoel/prometheus that referenced this pull request Nov 27, 2024
…rometheus#15242)

This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
verejoel pushed a commit to open-ch/prometheus that referenced this pull request Dec 17, 2024
…rometheus#15242)

This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
yuchen-db pushed a commit to yuchen-db/prometheus that referenced this pull request Apr 30, 2025
…rometheus#15242)

This introduces back some unlocking that was removed in prometheus#13286 but in a
more balanced way, as suggested by @pracucci.

For TSDBs with a lot of churn, Delete() can take a couple of seconds,
and while it's holding the mutex, reads and writes are blocked waiting
for that mutex, increasing the number of connections handled and memory
usage.

This implementation pauses every 4K labels processed (note that also
compared to prometheus#13286 we're not processing all the label-values anymore,
but only the affected ones, because of prometheus#14307), makes sure that it's
possible to get the read lock, and waits for a few milliseconds more.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0