-
Notifications
You must be signed in to change notification settings - Fork 9.6k
MemPostings.Delete()
: reduce locking/unlocking
#13286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemPostings.Delete()
: reduce locking/unlocking
#13286
Conversation
MemPostings.Delete()
: reduce locking/unlocking
I've recently seen a head GC taking 3h on a downstream project, and it looked like a possible lock contention, so I thought it would be interesting to recover this. I have no idea how I could benchmark it. It looks like it only would have impact under query pressure. |
tsdb/index/postings.go
Outdated
repl := make([]storage.SeriesRef, 0, len(p.m[n][l])) | ||
|
||
// Only take the write lock when there's actually something to write. | ||
p.mtx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlocking here means another goroutine can jump in and alter p.m
, which is unsafe because we are part-way through for n := range p.m
on line 297.
I suspect this is why the original code copied the keys and values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good point. Reverted copying of keys & values to their slices in b68c9c7
Also added proper estimation of slice sizes to avoid resizing them.
296ed0e
to
7fe831b
Compare
MemPostings.Delete is called from Head.gc(), i.e. it gets the IDs of the series that have churned. I'd assume that many label values aren't affected by that churn at all, so it doesn't make sense to touch the lock while checking them. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
7fe831b
to
9da7d70
Compare
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
I've optimized the code further, now locking at most once per label name in the storage, pushed a benchmark and updated the PR description. |
FWIW, I ran the benchmark with 1 iteration removing 100k series with 100 concurrent readers: it too 265s on |
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
/prombench main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed these changes live with @colega and they make sense to me.
I think the improvements will impact compaction times but also writes so I'm running Prombench to see if our existing live benchmark captures any of this.
Nice work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks fine; I spent some time trying to explain it better.
I wonder if the whole operation might be more efficient if we started with the series to be deleted, and collected together each name/value pair affected. I guess it depends on the
relative size of the Postings index and the set of series to be deleted.
But that would be another PR entirely.
tsdb/index/postings.go
Outdated
found := false | ||
for _, id := range p.m[n][l] { | ||
for _, id := range p.m[n][vals[i]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for after this PR: we could copy p.m[n][vals[i]]
in the loop earlier (line 313) and save hash lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was pushing the comment changes, I also changed this: 32f5c8d
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Right, I think we can just hook to the call made to Line 1966 in 08621be
It shouldn't be an issue to split this |
/prombench cancel |
Benchmark cancel is in progress. |
Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Update mimir-prometheus to incl MemPostings improvements Updating to bring changes from: - prometheus/prometheus#13286 - prometheus/prometheus#14286 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Update CHANGELOG.md Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> --------- Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com> Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
…15242) This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
…15242) This introduces back some unlocking that was removed in #13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to #13286 we're not processing all the label-values anymore, but only the affected ones, because of #14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
…rometheus#15242) This introduces back some unlocking that was removed in prometheus#13286 but in a more balanced way, as suggested by @pracucci. For TSDBs with a lot of churn, Delete() can take a couple of seconds, and while it's holding the mutex, reads and writes are blocked waiting for that mutex, increasing the number of connections handled and memory usage. This implementation pauses every 4K labels processed (note that also compared to prometheus#13286 we're not processing all the label-values anymore, but only the affected ones, because of prometheus#14307), makes sure that it's possible to get the read lock, and waits for a few milliseconds more. Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
(The description has been updated since I created this PR)
MemPostings.Delete
is called from Head.gc(), i.e. it gets the IDs of the series that have churned.With current implementation, it tried to acquire the
Lock()
for each label/value combination: this can be quite expensive for tenants who have lots of those combinations.This changes the implementation to call
Lock
only once per each label name (not value), and only if it is affected by the series churn.I'm also adding a benchmark, which I ran with
-benchtime=10x
: it's quite difficult to get meaningful result here as each time the operation called, the benchmarked entity is being changed.The benchmark runs some read operations which would make the
Lock()
operation wait, which is the scenario that I'm trying to mitigate, as in a downstream project we've seen some profiles waiting continuously on thisLock()
call.I haven't found the reason for such a high variance in some of the new implementation's results.
I only ran memory benchmarks with
readers=0
, and the results are very similar: