8000 Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] · Issue #11337 · grafana/mimir · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bug: Query-Scheduler - dequeueSelectNode : panic: runtime error: index out of range [-1] #11337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pbailhache opened this issue Apr 29, 2025 · 3 comments · Fixed by #11510
Closed
Assignees
Labels
bug Something isn't working

Comments

@pbailhache
Copy link
Contributor

What is the bug?

We're investigating a problem detected on one of our Mimir clusters.
We observed a lot of errors from the queriers:

rpc error: code = Unknown desc = querier has informed the scheduler it is shutting down

Following this, we found that the query schedulers were frequently panicking when managing dequeueing:

panic: runtime error: index out of range [-1]
goroutine 31 [running]:
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*QuerierWorkerQueuePriorityAlgo).dequeueSelectNode(0x47073d?, 0x7f77ee5f65b8?)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/tree_queue_algo_querier_worker_queue_priority.go:117 +0x5d
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*Node).dequeue(0xc001d206c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/multi_algorithm_tree_queue.go:253 +0x107
github.com/grafana/mimir/pkg/scheduler/queue/tree.(*MultiAlgorithmTreeQueue).Dequeue(0xc0008ff0a0?, 0xc002b15220?)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/tree/multi_algorithm_tree_queue.go:94 +0x33
github.com/grafana/mimir/pkg/scheduler/queue.(*queueBroker).dequeueRequestForQuerier(0xc001cdaa50, 0xc0026198c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue_broker.go:137 +0xc3
github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).trySendNextRequestForQuerier(0xc000a38bb0, 0xc0026198c0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:391 +0x3c
github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).dispatcherLoop(0xc000a38bb0)
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:323 +0x62d
created by github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting in goroutine 29
	/go/src/github.com/grafana/mimir/pkg/scheduler/queue/queue.go:258 +0x4f

Related source code (git tag mimir2.15.2):
https://github.com/grafana/mimir/blob/mimir-2.15.2/pkg/scheduler/queue/tree/tree_queue_algo_querier_worker_queue_priority.go#L117

    currentNodeName := qa.nodeOrder[qa.currentNodeOrderIndex]

I'm trying to understand what's happening here and why we try to dequeue a out of range index, but I'm not sure exactly where to look.

We cannot say for sure when it started due to log retention.

How to reproduce it?

It happens frequently, we do not have a way to manually reproduce it.

What did you think would happen?

No panic and restarting of query scheduler

What was your environment?

Mimir 2.15.2 running on multiple instances

Any additional context to share?

Here is the configuration for both the scheduler and the querier.

Query scheduler config:

query_scheduler:
    max_outstanding_requests_per_tenant: 100
    querier_forget_delay: 0s
    grpc_client_config:
        max_recv_msg_size: 209715200
        max_send_msg_size: 419430400
        grpc_compression: snappy
        rate_limit: 0
        rate_limit_burst: 0
        backoff_on_ratelimits: false
        backoff_config:
            min_period: 100ms
            max_period: 10s
            max_retries: 10
        initial_stream_window_size: 63KiB1023B
        initial_connection_window_size: 63KiB1023B
        tls_enabled: true
        tls_cert_path: /etc/ssl/ecdsa-s
8000
sl-cert-snakeoil.pem
        tls_key_path: /etc/ssl/ecdsa-ssl-key-snakeoil.pem
        tls_ca_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_server_name: ""
        tls_insecure_skip_verify: true
        tls_cipher_suites: ""
        tls_min_version: VersionTLS13
        connect_timeout: 5s
        connect_backoff_base_delay: 1s
        connect_backoff_max_delay: 5s
    service_discovery_mode: dns
    ring:
        kvstore:
            store: memberlist
            prefix: mimir/query_scheduler/
            consul:
                host: localhost:8500
                acl_token: '********'
                http_client_timeout: 20s
                consistent_reads: false
                watch_rate_limit: 1
                watch_burst_size: 1
                cas_retry_delay: 1s
            etcd:
                endpoints: []
                dial_timeout: 10s
                max_retries: 10
                tls_enabled: false
                tls_cert_path: ""
                tls_key_path: ""
                tls_ca_path: ""
                tls_server_name: ""
                tls_insecure_skip_verify: false
                tls_cipher_suites: ""
                tls_min_version: ""
                username: ""
                password: ""
            multi:
                primary: ""
                secondary: ""
                mirror_enabled: false
                mirror_timeout: 2s
        heartbeat_period: 15s
        heartbeat_timeout: 1m0s
        instance_id: *******
        instance_interface_names:
            - eth1
        instance_port: 0
        instance_addr: ""
        instance_enable_ipv6: false
    max_used_instances: 0

Querier config:

querier:
    query_store_after: 23h45m0s
    store_gateway_client:
        tls_enabled: true
        tls_cert_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_key_path: /etc/ssl/ecdsa-ssl-key-snakeoil.pem
        tls_ca_path: /etc/ssl/ecdsa-ssl-cert-snakeoil.pem
        tls_server_name: ""
        tls_insecure_skip_verify: true
        tls_cipher_suites: ""
        tls_min_version: VersionTLS13
    shuffle_sharding_ingesters_enabled: true
    prefer_availability_zone: ""
    streaming_chunks_per_ingester_series_buffer_size: 256
    streaming_chunks_per_store_gateway_series_buffer_size: 256
    minimize_ingester_requests: true
    minimize_ingester_requests_hedging_delay: 3s
    query_engine: mimir
    enable_query_engine_fallback: true
    max_concurrent: 8
    timeout: 5m0s
    max_samples: 50000000
    default_evaluation_interval: 1m0s
    lookback_delta: 5m0s
    promql_experimental_functions_enabled: false

I can provide more logs if needed, as the problem happens frequently.

@pbailhache pbailhache added the bug Something isn't working label Apr 29, 2025
@chencs
Copy link
Contributor
chencs commented Apr 30, 2025

Hi, I'm taking a look at trying to reproduce this! Just to clarify, are you seeing cortex_query_scheduler_connected_frontend_clients drop just before the scheduler panics?

@francoposa
Copy link
Contributor

Hi @pbailhache, thank you for the submission.

We believe we have found the source of this and a fix, though we need to write a test to repro for confirmation.

When a querier-worker is deregistered from the scheduler (due to its goroutine crashing, the queriering, ooming, or some normal querier shutdown procedure), it's querier-worker ID is set to to -1 to reflect this.

However, in some cases, there may still be dequeue requests from that deregistered querier-worker sitting in an internal queue from the scheduler. When we attempt to service a dequeue request with the querier-worker ID of -1, we get this panic.
We should instead drop those dequeue requests before attempting to service them, as the querier-worker connection is no longer present to receive a dequeue anyway.

@pbailhache
Copy link
Contributor Author

I think that's it yes.

I looked a bit more and it seems we have a problem on our queriers causing a lot of connection drop, here is a graph for one hour on one of our scheduler :

Scheduler querier drop

Here is a zoomed part, panic happens at 20:19:52 here.

Scheduler querier drop

It's the same across all of our schedulers on this cluster. I'm digging a bit on what's happening on the querier side but I think you were right about a lot the drop of connection triggering the panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2F7B 3 participants
0