-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Bug: Cohorts stuck in-progress #32745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 #30995 |
+1 #31177 |
Seems like this is quite common just looking at cohorts that have a last calculation greater than 24 hours
|
Looks like we got out first customer with a nested cohort that is seeing stale data: https://posthoghelp.zendesk.com/agent/tickets/32113 |
+1 https://posthoghelp.zendesk.com/agent/tickets/32091 - saving again resolved the issue |
The fix to restart stuck cohort calculations is rolled out but it's is going to take a while. Maybe we can increase the number of workers and parallel executions but I'll need to get some metrics to monitor and probably talk to team clickhouse In US, we had 27k stuck yesterday and today we have 24k.
|
I reverted the restart stuck cohorts CR. See this thread for more details: https://posthog.slack.com/archives/C076R4753Q8/p1749310677162049 |
Uh oh!
There was an error while loading. Please reload this page.
Bug Description
Bug description
Occasionally we're finding cohorts that are 'stuck' in a permanent in progress calculation state, with stale results from whenever they last recalculated.
Anything using the stale cohort calculation will also be using outdated results.
A recalculation of the cohort can be forced by clicking the Save button on the cohort page in the PostHog UI but this does not always resolve the issue
Additional context
So far we are aware of the following failure modes resulting in stuck cohorts
Too many
in_cohort
filters resulting in memory limit exceeded error from CH (Issue: bug(cohorts): Cohort filters such asin_cohort
and its negation can end up breaking our query engine #32449)Too many of any filter resulting in memory limit exceeded error from CH
Celery worker doesn't fail gracefully or retry
is_calculating=true
and one of the following errors in CH:Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: [::ffff:10.31.214.147]:40380, local: [::ffff:10.0.164.69]:9440). (NETWORK_ERROR) (version 24.8.14.39 (official build))
Code: 394. DB::Exception: Query was cancelled or a client has unexpectedly dropped the connection. (QUERY_WAS_CANCELLED) (version 24.8.14.39 (official build))
is_calculating
field in the cohorts table was never set back to falseRelated tickets:
Debug info
The text was updated successfully, but these errors were encountered: