Bug: Cohorts stuck in-progress #32745

luke-belton · 2025-05-28T09:01:30Z

Bug Description

Bug description

Occasionally we're finding cohorts that are 'stuck' in a permanent in progress calculation state, with stale results from whenever they last recalculated.

Anything using the stale cohort calculation will also be using outdated results.

A recalculation of the cohort can be forced by clicking the Save button on the cohort page in the PostHog UI but this does not always resolve the issue

Additional context

So far we are aware of the following failure modes resulting in stuck cohorts

Too many in_cohort filters resulting in memory limit exceeded error from CH (Issue: bug(cohorts): Cohort filters such as in_cohort and its negation can end up breaking our query engine #32449)
- This has been mostly fixed with fix(cohorts): in/not in cohort calculations stuck #33186
- TODO: calculate nested cohorts in a sorted order to minimize stale data
Too many of any filter resulting in memory limit exceeded error from CH
- Retries unfortunately won't help these errors and we need to optimize how we are querying CH
- Conversation with CH team discussing query optimizations: https://posthog.slack.com/archives/C076R4753Q8/p1749239456159729
Celery worker doesn't fail gracefully or retry
- This one still needs more investigation as the only symptoms of it are is_calculating=true and one of the following errors in CH:
  - Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: [::ffff:10.31.214.147]:40380, local: [::ffff:10.0.164.69]:9440). (NETWORK_ERROR) (version 24.8.14.39 (official build))
  - Code: 394. DB::Exception: Query was cancelled or a client has unexpectedly dropped the connection. (QUERY_WAS_CANCELLED) (version 24.8.14.39 (official build))
- The is_calculating field in the cohorts table was never set back to false
- Retries do seem to help this failure mode
- Conversation with support eng where they ran into this and how to diagnose https://posthog.slack.com/archives/C07Q2U4BH4L/p1749241659854979
- TODO: find out why the connection is being closed by the worker

Related tickets:

Debug info

Kind: support

Target area: cohorts

Report event: http://go/ticketByUUID/4722336e-1568-437f-80cf-464be7dc742d

Session: https://us.posthog.com/project/sTMFPsFhdP1Ssg/replay/019712ab-bdc4-7997-9ad5-4d8e79fbaa67?t=3601

Exceptions: https://us.posthog.com/project/2/error_tracking?filterGroup=%7B%22type%22%3A%22AND%22%2C%22values%22%3A%5B%7B%22type%22%3A%22AND%22%2C%22values%22%3A%5B%7B%22key%22%3A%22%24session_id%22%2C%22value%22%3A%5B%22019712ab-bdc4-7997-9ad5-4d8e79fbaa67%22%5D%2C%22operator%22%3A%22exact%22%2C%22type%22%3A%22event%22%7D%5D%7D%5D%7D

Location: https://us.posthog.com/project/85440

Persons-on-events mode for project: person_id_override_properties_on_events

The text was updated successfully, but these errors were encountered:

luke-belton · 2025-05-28T09:05:37Z

+1 #30995

luke-belton · 2025-05-28T09:07:11Z

+1 #31177

benHPostHog · 2025-06-04T02:05:04Z

+2 https://posthoghelp.zendesk.com/agent/tickets/31859 and https://posthoghelp.zendesk.com/agent/tickets/31812

kmt901 · 2025-06-05T12:44:39Z

+1 https://posthoghelp.zendesk.com/agent/tickets/32004

andyzzhao · 2025-06-05T16:52:57Z

Seems like this is quite common just looking at cohorts that have a last calculation greater than 24 hours

SELECT 
    -- team_id,
    COUNT(1) as cohort_count
FROM posthog_cohort 
WHERE last_calculation < NOW() - INTERVAL '24 hours'
AND is_static = false
GROUP BY team_id
ORDER BY cohort_count DESC;

3,745
2,096
979
886
693
632
...

andyzzhao · 2025-06-06T22:15:25Z

Looks like we got out first customer with a nested cohort that is seeing stale data: https://posthoghelp.zendesk.com/agent/tickets/32113

benHPostHog · 2025-06-06T23:47:43Z

+1 https://posthoghelp.zendesk.com/agent/tickets/32091 - saving again resolved the issue

andyzzhao · 2025-06-07T00:02:54Z

The fix to restart stuck cohort calculations is rolled out but it's is going to take a while. Maybe we can increase the number of workers and parallel executions but I'll need to get some metrics to monitor and probably talk to team clickhouse

In US, we had 27k stuck yesterday and today we have 24k.

SELECT count(1)
FROM posthog_cohort
where deleted = false
and is_static = false
and is_calculating = true
and last_calculation < NOW() - INTERVAL '24 hours'
AND errors_calculating <= 20;

24,620

andyzzhao · 2025-06-09T18:00:17Z

I reverted the restart stuck cohorts CR. See this thread for more details: https://posthog.slack.com/archives/C076R4753Q8/p1749310677162049

benHPostHog · 2025-06-17T21:09:15Z

+1 https://posthoghelp.zendesk.com/agent/tickets/32665

luke-belton added bug Something isn't working right feature/cohorts Feature Tag: Cohorts labels May 28, 2025

haacked added the team/feature-flags label Jun 2, 2025

andyzzhao self-assigned this Jun 2, 2025

andyzzhao mentioned this issue Jun 5, 2025

bug(cohorts): Cohort filters such as in_cohort and its negation can end up breaking our query engine #32449

Closed

haacked added this to Feature Flags Jun 5, 2025

andyzzhao mentioned this issue Jun 6, 2025

fix(cohorts): restart stuck cohort calculations #33279

Merged

3 tasks

andyzzhao moved this to Todo in Feature Flags Jun 6, 2025

andyzzhao moved this from Todo to In Progress in Feature Flags Jun 6, 2025

andyzzhao mentioned this issue Jun 6, 2025

fix(cohorts): calculate nested cohorts in order #33341

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Cohorts stuck in-progress #32745

Bug: Cohorts stuck in-progress #32745

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bug: Cohorts stuck in-progress #32745

Bug: Cohorts stuck in-progress #32745

Comments

Uh oh!

Bug Description

Bug description

Additional context

Debug info

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!