Closed
Description
How to reproduce it?
- Edit sample_limit from
10000
to100001
or - Edit scrape_timeout from
30s
to1m
Trigger reload, then 30 of 3000 instances report down
because of lastError="Get http://10.69.140.12:9100/metrics: context canceled"
What did you expect to see?
Reload
will cancel involved current scrapeLoop
, and prometheus mark this scrapeLoop
as down.
We can not distinguish whether instances were really down or just misreported by reload now.
Possible solution 1:
set up
value as 'NAN' if context canceled
Possible solution 2:
do not record up
if context canceled
Environment
- Prometheus version:
prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f)
build user: root@7ea0ae865f12
build date: 20200213-23:50:02
go version: go1.13.8
- Prometheus configuration file:
- job_name: node_exporter
honor_labels: true
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 33s
metrics_path: /metrics
scheme: http
sample_limit: 100001