8000 Reload all scrape pools concurrently by prymitive · Pull Request #16595 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Reload all scrape pools concurrently #16595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 9, 2025

Conversation

prymitive
Copy link
Contributor

At the moment all scrape pools that need to be reloaded are reloaded one by one. While reloads are ongoing mtxScrape is locked. For each pool that's being reloaded we need to wait until all targets are updated. This whole process can take a while and the more scrape pools to reload the longer. At the same time all pools are independent and there's no real reason to do them one-by-one. Reload each pool in a seperate goroutine so we finish config reload as ASAP as possible and unlock the mtxScrape.

8000
At the moment all scrape pools that need to be reloaded are reloaded one by one. While reloads are ongoing mtxScrape is locked.
For each pool that's being reloaded we need to wait until all targets are updated.
This whole process can take a while and the more scrape pools to reload the longer.
At the same time all pools are independent and there's no real reason to do them one-by-one.
Reload each pool in a seperate goroutine so we finish config reload as ASAP as possible and unlock the mtxScrape.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
@prymitive
Copy link
Contributor Author

A bit more context: while mtxScrape is locked you can't scrape Prometheus, so if the reload takes too long Prometheus appears to be down because it's not possible to scrape it

Copy link
Member
@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this, code lgtm,

But I'd be interested in some numbers if you could share; for how much pools/targets you need to wait for how long? any profiling, which operations are the heaviest?

given that stop() and reload() themselves spawn routines, i don't know at which scale this could be problematic...

@prymitive
Copy link
Contributor Author

But I'd be interested in some numbers if you could share; for how much pools/targets you need to wait for how long? any profiling, which operations are the heaviest?

I've noticed this issue because scrapes against the affected instance were failing and I've noticed that config reload was taking almost 6 minutes there:

time=2025-05-12T17:47:36.315Z level=INFO source=main.go:1485 msg="Completed loading of configuration file" db_storage=1.43µs remote_storage=1.32µs web_handler=640ns query_engine=710ns scrape=5m45.078648964s scrape_sd=130.431446ms notify=265.544µs notify_sd=11.65µs rules=525.933409ms tracing=5.63µs filename=/etc/prometheus/config.yaml totalDuration=5m45.735612356s

When I triggered a similar config change after applying this patch it all went down to 19 seconds:

time=2025-05-14T09:31:43.424Z level=INFO source=main.go:1488 msg="Completed loading of configuration file" db_storage=1.32µs remote_storage=1.24µs web_handler=560ns query_engine=1.09µs scrape=18.463130139s scrape_sd=133.919268ms notify=219.733µs notify_sd=10.83µs rules=462.854514ms tracing=6.28µs filename=/etc/prometheus/config.yaml totalDuration=19.060452528s

This particular issue was with a config file that had 44 scrape jobs defined, each with a different number of targets (3 to 30), but they all share the same pattern where the targets are read from a JSON file but there's also a relabel_configs that filters it with a regex.

- file_sd_configs:
  - files:
    - targets.json
  job_name: ...
  relabel_configs:
  - action: keep
    regex: instance1|instance2|instance3|...
    source_labels:
    - instance

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
@machine424
Copy link
Member

thanks for the context.

That's a some good improvement. For such config, 6 minutes is insane, but 19 seconds still feels like a lo.
I'll try to reproduce this to see what could be slowing the process down; what kind of changes happen before the reload?

A bit more context: while mtxScrape is locked you can't scrape Prometheus, so if the reload takes too long Prometheus appears to be down because it's not possible to scrape it

That’s interesting. I would expect only the target/scraping related routes to be blocked, but not /metrics. Are all other routes blocked during that time?

@prymitive
Copy link
Contributor Author

what kind of changes happen before the reload?

It's different each time but I could trigger it by simply re-ordering elements of regex: instance1|instance2|instance3|... -> regex: instance1|instance3|instance2...

That’s interesting. I would expect only the target/scraping related routes to be blocked, but not /metrics. Are all other routes blocked during that time?

Some of the exposed metrics describe scrape pools and there's a gatherer that will get the list of active targets (TargetsAll call), which is behind a lock

https://github.com/prometheus/prometheus/blob/main/scrape/metrics.go#L300-L302
https://github.com/prometheus/prometheus/blob/main/scrape/metrics.go#L323
https://github.com/prometheus/prometheus/blob/main/scrape/manager.go#L340

@prymitive
Copy link
Contributor Author

Also: ApplyConfig() is a big function that does a lot, we could make locks there more granular as we don't need the lock being held all the time, but doing so would make the exported metrics update less atomic - right now you get metrics before the reload and after the reload, so everything on metrics response will be consistent, if we make locking inside ApplyConfig more granular then when you scrape Prometehus while it's running ApplyConfig you might get metrics from scrape pools that are being removed etc. That seemed risky to me so I didn't do it here.

@prymitive
Copy link
Contributor Author

I could make the locks more granular, which would make ApplyConfig() less blocking, but at the cost of /metrics responses potentially showing not 100% consistent data if one is made half way through the ApplyConfig run, what do you think @machine424 ?

@machine424
Copy link
Member

Thanks for the details.

Yes, relaxing the present "consistency" model will certainly come at a cost, but blocking the endpoint for 19 seconds isn't ideal either.

I was able to reproduce this, got 1m48s on main, if you didn't do that yet, before we throw concurrency at it, I'd check where exactly we spend that time, if we could make the things more abrupt/avoid some steps...

@prymitive
Copy link
Contributor Author

I was able to reproduce this, got 1m48s on main, if you didn't do that yet, before we throw concurrency at it, I'd check where exactly we spend that time, if we could make the things more abrupt/avoid some steps...

There's always at least one unhealthy target in each pool and that slows down each scrape pool stop() when calling reload()->restartLoops() as we need to wait for all scrapes to complete before a scrape pool is fully stopped. This adds a few second delay when stopping each pool and with enough pool and stopping each one-by-one it all adds up to a big delay. Reloading all pools at the same time means we just wait for the slowest one.

@prymitive
Copy link
Contributor Author

Anything else I can / should do here?

@machine424
Copy link
Member

As I mentioned above. I agree that the change will help speeding up things, again, before opting for concurrency, I wanted to know (since it’s not clear if you’ve verified this) if we can avoid that delay altogether.

It turns out that, prior to #7752 there was no wait/the reload was abrupt with confusing side effects. If concurrency isn't enough, we could always consider restoring that behavior without the side effects in #7097 behind a flag (we could use sth like https://pkg.go.dev/context#CancelCauseFunc to identify cancel for reload).

Also, with a bit more logic, we could be selective about which changes actually require restarting a scrape loop, for post-scrape operations like metrics relabeling (don't really influence the scrape), we could just update the existing loop with the new config...

That said, let’s go 8000 with this change for now. Since we’ll be spawning a goroutine per target/loop, have you monitored how it behaves on some instances with "many" targets to reload?

(Regarding the /metrics and other targets related API stalling, there was as attempt to address that #8104 reverted in #8176), anyway, we'll need to either relax consistency or keep an accessible copy of needed data and swap, both options come at a cost, let's continue the discussion somewhere else.

@prymitive
Copy link
Contributor Author

have you monitored how it behaves on some instances with "many" targets to reload?

To an extent, this PR was raised to fix an issue we had where we would get alerts about a few busy instances of Prometheus being down because they were timing out scrape requests and so marked as down. We’ve been running with this patch for a while now and since deploying it we stopped getting these alerts while not seeing any other issues.

Copy link
Member
@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's give this a try then.

@machine424 machine424 merged commit c528293 into prometheus:main Jun 9, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0