Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] · Issue #11337 · grafana/mimir · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm taking a look at trying to reproduce this! Just to clarify, are you seeing cortex_query_scheduler_connected_frontend_clients drop just before the scheduler panics?
We believe we have found the source of this and a fix, though we need to write a test to repro for confirmation.
When a querier-worker is deregistered from the scheduler (due to its goroutine crashing, the queriering, ooming, or some normal querier shutdown procedure), it's querier-worker ID is set to to -1 to reflect this.
However, in some cases, there may still be dequeue requests from that deregistered querier-worker sitting in an internal queue from the scheduler. When we attempt to service a dequeue request with the querier-worker ID of -1, we get this panic.
We should instead drop those dequeue requests before attempting to service them, as the querier-worker connection is no longer present to receive a dequeue anyway.
I looked a bit more and it seems we have a problem on our queriers causing a lot of connection drop, here is a graph for one hour on one of our scheduler :
Here is a zoomed part, panic happens at 20:19:52 here.
It's the same across all of our schedulers on this cluster. I'm digging a bit on what's happening on the querier side but I think you were right about a lot the drop of connection triggering the panic.
What is the bug?
We're investigating a problem detected on one of our Mimir clusters.
We observed a lot of errors from the queriers:
Following this, we found that the query schedulers were frequently panicking when managing dequeueing:
Related source code (git tag mimir2.15.2):
https://github.com/grafana/mimir/blob/mimir-2.15.2/pkg/scheduler/queue/tree/tree_queue_algo_querier_worker_queue_priority.go#L117
I'm trying to understand what's happening here and why we try to dequeue a out of range index, but I'm not sure exactly where to look.
We cannot say for sure when it started due to log retention.
How to reproduce it?
It happens frequently, we do not have a way to manually reproduce it.
What did you think would happen?
No panic and restarting of query scheduler
What was your environment?
Mimir 2.15.2 running on multiple instances
Any additional context to share?
Here is the configuration for both the scheduler and the querier.
Query scheduler config:
Querier config:
I can provide more logs if needed, as the problem happens frequently.
The text was updated successfully, but these errors were encountered: