Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

pbailhache · 2025-04-29T09:43:01Z

What is the bug?

We're investigating a problem detected on one of our Mimir clusters.
We observed a lot of errors from the queriers:

rpc error: code = Unknown desc = querier has informed the scheduler it is shutting down

Following this, we found that the query schedulers were frequently panicking when managing dequeueing:

panic: runtime error: index out of range [-1]
goroutine 31 [running]:
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*QuerierWorkerQueuePriorityAlgo).dequeueSelectNode(0x47073d?, 0x7f77ee5f65b8?)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/tree_queue_algo_querier_worker_queue_priority.go:117 +0x5d
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*Node).dequeue(0xc001d206c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/multi_algorithm_tree_queue.go:253 +0x107
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*MultiAlgorithmTreeQueue).Dequeue(0xc0008ff0a0?, 0xc002b15220?)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/multi_algorithm_tree_queue.go:94 +0x33
github.com/grafana/mimir/pkg/scheduler/queue.(*queueBroker).dequeueRequestForQuerier(0xc001cdaa50, 0xc0026198c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue_broker.go:137 +0xc3
github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).trySendNextRequestForQuerier(0xc000a38bb0, 0xc0026198c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:391 +0x3c
github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).dispatcherLoop(0xc000a38bb0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:323 +0x62d
created by github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting in goroutine 29
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:258 +0x4f

Related source code (git tag mimir2.15.2):
https://github.com/grafana/mimir/blob/mimir-2.15.2/pkg/scheduler/queue/tree/tree_queue_algo_querier_worker_queue_priority.go#L117

    currentNodeName := qa.nodeOrder[qa.currentNodeOrderIndex]

I'm trying to understand what's happening here and why we try to dequeue a out of range index, but I'm not sure exactly where to look.

We cannot say for sure when it started due to log retention.

How to reproduce it?

It happens frequently, we do not have a way to manually reproduce it.

What did you think would happen?

No panic and restarting of query scheduler

What was your environment?

Mimir 2.15.2 running on multiple instances

Any additional context to share?

Here is the configuration for both the scheduler and the querier.

Query scheduler config:

query_scheduler:
    max_outstanding_requests_per_tenant: 100
    querier_forget_delay: 0s
    grpc_client_config:
        max_recv_msg_size: 209715200
        max_send_msg_size: 419430400
        grpc_compression: snappy
        rate_limit: 0
        rate_limit_burst: 0
        backoff_on_ratelimits: false
        backoff_config:
            min_period: 100ms
            max_period: 10s
            max_retries: 10
        initial_stream_window_size: 63KiB1023B
        initial_connection_window_size: 63KiB1023B
        tls_enabled: true
        tls_cert_path: /etc/ssl/ecdsa-s
8000
sl-cert-snakeoil.pem
        tls_key_path: /etc/ssl/ecdsa-ssl-key-snakeoil.pem
        tls_ca_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_server_name: ""
        tls_insecure_skip_verify: true
        tls_cipher_suites: ""
        tls_min_version: VersionTLS13
        connect_timeout: 5s
        connect_backoff_base_delay: 1s
        connect_backoff_max_delay: 5s
    service_discovery_mode: dns
    ring:
        kvstore:
            store: memberlist
            prefix: mimir/query_scheduler/
            consul:
                host: localhost:8500
                acl_token: '********'
                http_client_timeout: 20s
                consistent_reads: false
                watch_rate_limit: 1
                watch_burst_size: 1
                cas_retry_delay: 1s
            etcd:
                endpoints: []
                dial_timeout: 10s
                max_retries: 10
                tls_enabled: false
                tls_cert_path: ""
                tls_key_path: ""
                tls_ca_path: ""
                tls_server_name: ""
                tls_insecure_skip_verify: false
                tls_cipher_suites: ""
                tls_min_version: ""
                username: ""
                password: ""
            multi:
                primary: ""
                secondary: ""
                mirror_enabled: false
                mirror_timeout: 2s
        heartbeat_period: 15s
        heartbeat_timeout: 1m0s
        instance_id: *******
        instance_interface_names:
            - eth1
        instance_port: 0
        instance_addr: ""
        instance_enable_ipv6: false
    max_used_instances: 0

Querier config:

querier:
    query_store_after: 23h45m0s
    store_gateway_client:
        tls_enabled: true
        tls_cert_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_key_path: /etc/ssl/ecdsa-ssl-key-snakeoil.pem
        tls_ca_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_server_name: ""
        tls_insecure_skip_verify: true
        tls_cipher_suites: ""
        tls_min_version: VersionTLS13
    shuffle_sharding_ingesters_enabled: true
    prefer_availability_zone: ""
    streaming_chunks_per_ingester_series_buffer_size: 256
    streaming_chunks_per_store_gateway_series_buffer_size: 256
    minimize_ingester_requests: true
    minimize_ingester_requests_hedging_delay: 3s
    query_engine: mimir
    enable_query_engine_fallback: true
    max_concurrent: 8
    timeout: 5m0s
    max_samples: 50000000
    default_evaluation_interval: 1m0s
    lookback_delta: 5m0s
    promql_experimental_functions_enabled: false

I can provide more logs if needed, as the problem happens frequently.

The text was updated successfully, but these errors were encountered:

chencs · 2025-04-30T23:13:36Z

Hi, I'm taking a look at trying to reproduce this! Just to clarify, are you seeing cortex_query_scheduler_connected_frontend_clients drop just before the scheduler panics?

francoposa · 2025-05-02T17:44:10Z

Hi @pbailhache, thank you for the submission.

We believe we have found the source of this and a fix, though we need to write a test to repro for confirmation.

When a querier-worker is deregistered from the scheduler (due to its goroutine crashing, the queriering, ooming, or some normal querier shutdown procedure), it's querier-worker ID is set to to -1 to reflect this.

However, in some cases, there may still be dequeue requests from that deregistered querier-worker sitting in an internal queue from the scheduler. When we attempt to service a dequeue request with the querier-worker ID of -1, we get this panic.
We should instead drop those dequeue requests before attempting to service them, as the querier-worker connection is no longer present to receive a dequeue anyway.

pbailhache · 2025-05-05T09:41:20Z

I think that's it yes.

I looked a bit more and it seems we have a problem on our queriers causing a lot of connection drop, here is a graph for one hour on one of our scheduler :

Here is a zoomed part, panic happens at 20:19:52 here.

It's the same across all of our schedulers on this cluster. I'm digging a bit on what's happening on the querier side but I think you were right about a lot the drop of connection triggering the panic.

pbailhache added the bug Something isn't working label Apr 29, 2025

francoposa assigned francoposa and chencs May 2, 2025

chencs mentioned this issue May 21, 2025

fix scheduler panic when serving dequeue request for disconnected querier-worker #11510

Merged

4 tasks

francoposa closed this as completed in #11510 May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

Uh oh!

Uh oh!

Uh oh!

Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

Comments

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

Uh oh!

Uh oh!

Uh oh!