Reload all scrape pools concurrently #16595

prymitive · 2025-05-13T10:52:05Z

At the moment all scrape pools that need to be reloaded are reloaded one by one. While reloads are ongoing mtxScrape is locked. For each pool that's being reloaded we need to wait until all targets are updated. This whole process can take a while and the more scrape pools to reload the longer. At the same time all pools are independent and there's no real reason to do them one-by-one. Reload each pool in a seperate goroutine so we finish config reload as ASAP as possible and unlock the mtxScrape.

At the moment all scrape pools that need to be reloaded are reloaded one by one. While reloads are ongoing mtxScrape is locked. For each pool that's being reloaded we need to wait until all targets are updated. This whole process can take a while and the more scrape pools to reload the longer. At the same time all pools are independent and there's no real reason to do them one-by-one. Reload each pool in a seperate goroutine so we finish config reload as ASAP as possible and unlock the mtxScrape. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

prymitive · 2025-05-14T09:35:21Z

A bit more context: while mtxScrape is locked you can't scrape Prometheus, so if the reload takes too long Prometheus appears to be down because it's not possible to scrape it

machine424

thanks for this, code lgtm,

But I'd be interested in some numbers if you could share; for how much pools/targets you need to wait for how long? any profiling, which operations are the heaviest?

given that stop() and reload() themselves spawn routines, i don't know at which scale this could be problematic...

scrape/manager.go

prymitive · 2025-05-14T16:17:18Z

But I'd be interested in some numbers if you could share; for how much pools/targets you need to wait for how long? any profiling, which operations are the heaviest?

I've noticed this issue because scrapes against the affected instance were failing and I've noticed that config reload was taking almost 6 minutes there:

time=2025-05-12T17:47:36.315Z level=INFO source=main.go:1485 msg="Completed loading of configuration file" db_storage=1.43µs remote_storage=1.32µs web_handler=640ns query_engine=710ns scrape=5m45.078648964s scrape_sd=130.431446ms notify=265.544µs notify_sd=11.65µs rules=525.933409ms tracing=5.63µs filename=/etc/prometheus/config.yaml totalDuration=5m45.735612356s

When I triggered a similar config change after applying this patch it all went down to 19 seconds:

time=2025-05-14T09:31:43.424Z level=INFO source=main.go:1488 msg="Completed loading of configuration file" db_storage=1.32µs remote_storage=1.24µs web_handler=560ns query_engine=1.09µs scrape=18.463130139s scrape_sd=133.919268ms notify=219.733µs notify_sd=10.83µs rules=462.854514ms tracing=6.28µs filename=/etc/prometheus/config.yaml totalDuration=19.060452528s

This particular issue was with a config file that had 44 scrape jobs defined, each with a different number of targets (3 to 30), but they all share the same pattern where the targets are read from a JSON file but there's also a relabel_configs that filters it with a regex.

- file_sd_configs:
  - files:
    - targets.json
  job_name: ...
  relabel_configs:
  - action: keep
    regex: instance1|instance2|instance3|...
    source_labels:
    - instance

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

machine424 · 2025-05-14T18:11:00Z

thanks for the context.

That's a some good improvement. For such config, 6 minutes is insane, but 19 seconds still feels like a lo.
I'll try to reproduce this to see what could be slowing the process down; what kind of changes happen before the reload?

A bit more context: while mtxScrape is locked you can't scrape Prometheus, so if the reload takes too long Prometheus appears to be down because it's not possible to scrape it

That’s interesting. I would expect only the target/scraping related routes to be blocked, but not /metrics. Are all other routes blocked during that time?

prymitive · 2025-05-15T09:53:33Z

what kind of changes happen before the reload?

That’s interesting. I would expect only the target/scraping related routes to be blocked, but not /metrics. Are all other routes blocked during that time?

Some of the exposed metrics describe scrape pools and there's a gatherer that will get the list of active targets (TargetsAll call), which is behind a lock

https://github.com/prometheus/prometheus/blob/main/scrape/metrics.go#L300-L302
https://github.com/prometheus/prometheus/blob/main/scrape/metrics.go#L323
https://github.com/prometheus/prometheus/blob/main/scrape/manager.go#L340

prymitive · 2025-05-15T11:06:06Z

Also: ApplyConfig() is a big function that does a lot, we could make locks there more granular as we don't need the lock being held all the time, but doing so would make the exported metrics update less atomic - right now you get metrics before the reload and after the reload, so everything on metrics response will be consistent, if we make locking inside ApplyConfig more granular then when you scrape Prometehus while it's running ApplyConfig you might get metrics from scrape pools that are being removed etc. That seemed risky to me so I didn't do it here.

prymitive · 2025-05-19T08:35:42Z

I could make the locks more granular, which would make ApplyConfig() less blocking, but at the cost of /metrics responses potentially showing not 100% consistent data if one is made half way through the ApplyConfig run, what do you think @machine424 ?

machine424 · 2025-05-21T07:21:50Z

Thanks for the details.

Yes, relaxing the present "consistency" model will certainly come at a cost, but blocking the endpoint for 19 seconds isn't ideal either.

I was able to reproduce this, got 1m48s on main, if you didn't do that yet, before we throw concurrency at it, I'd check where exactly we spend that time, if we could make the things more abrupt/avoid some steps...

prymitive · 2025-05-21T08:49:02Z

I was able to reproduce this, got 1m48s on main, if you didn't do that yet, before we throw concurrency at it, I'd check where exactly we spend that time, if we could make the things more abrupt/avoid some steps...

There's always at least one unhealthy target in each pool and that slows down each scrape pool stop() when calling reload()->restartLoops() as we need to wait for all scrapes to complete before a scrape pool is fully stopped. This adds a few second delay when stopping each pool and with enough pool and stopping each one-by-one it all adds up to a big delay. Reloading all pools at the same time means we just wait for the slowest one.

prymitive · 2025-05-27T10:18:16Z

Anything else I can / should do here?

scrape/manager.go

machine424 · 2025-06-03T18:02:54Z

As I mentioned above. I agree that the change will help speeding up things, again, before opting for concurrency, I wanted to know (since it’s not clear if you’ve verified this) if we can avoid that delay altogether.

It turns out that, prior to #7752 there was no wait/the reload was abrupt with confusing side effects. If concurrency isn't enough, we could always consider restoring that behavior without the side effects in #7097 behind a flag (we could use sth like https://pkg.go.dev/context#CancelCauseFunc to identify cancel for reload).

Also, with a bit more logic, we could be selective about which changes actually require restarting a scrape loop, for post-scrape operations like metrics relabeling (don't really influence the scrape), we could just update the existing loop with the new config...

That said, let’s go 8000 with this change for now. Since we’ll be spawning a goroutine per target/loop, have you monitored how it behaves on some instances with "many" targets to reload?

(Regarding the /metrics and other targets related API stalling, there was as attempt to address that #8104 reverted in #8176), anyway, we'll need to either relax consistency or keep an accessible copy of needed data and swap, both options come at a cost, let's continue the discussion somewhere else.

prymitive · 2025-06-03T19:05:24Z

have you monitored how it behaves on some instances with "many" targets to reload?

To an extent, this PR was raised to fix an issue we had where we would get alerts about a few busy instances of Prometheus being down because they were timing out scrape requests and so marked as down. We’ve been running with this patch for a while now and since deploying it we stopped getting these alerts while not seeing any other issues.

machine424

Thanks, let's give this a try then.

prymitive force-pushed the scrapePoolReload branch from edaded0 to 74995a5 Compare May 13, 2025 12:08

prymitive force-pushed the scrapePoolReload branch from 74995a5 to 04ae5d5 Compare May 13, 2025 12:33

machine424 reviewed May 14, 2025

View reviewed changes

scrape/manager.go Outdated Show resolved Hide resolved

Address PR review feedback

639fa66

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

machine424 reviewed Jun 3, 2025

View reviewed changes

scrape/manager.go Show resolved Hide resolved

machine424 approved these changes Jun 4, 2025

View reviewed changes

machine424 merged commit c528293 into prometheus:main Jun 9, 2025
27 checks passed

Konstantinov-Innokentii mentioned this pull request Jun 11, 2025

Sync prometheus at 19848bb grafana/mimir-prometheus#888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reload all scrape pools concurrently #16595

Reload all scrape pools concurrently #16595

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reload all scrape pools concurrently #16595

Reload all scrape pools concurrently #16595

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!