8000 Bug: Cohorts stuck in-progress · Issue #32745 · PostHog/posthog · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bug: Cohorts stuck in-progress #32745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luke-belton opened this issue May 28, 2025 · 10 comments
Open

Bug: Cohorts stuck in-progress #32745

luke-belton opened this issue May 28, 2025 · 10 comments
Assignees
Labels
bug Something isn't working right feature/cohorts Feature Tag: Cohorts team/feature-flags

Comments

@luke-belton
Copy link
Contributor
luke-belton commented May 28, 2025

Bug Description

Bug description

Occasionally we're finding cohorts that are 'stuck' in a permanent in progress calculation state, with stale results from whenever they last recalculated.

Anything using the stale cohort calculation will also be using outdated results.

A recalculation of the cohort can be forced by clicking the Save button on the cohort page in the PostHog UI but this does not always resolve the issue

Additional context

So far we are aware of the following failure modes resulting in stuck cohorts

  1. Too many in_cohort filters resulting in memory limit exceeded error from CH (Issue: bug(cohorts): Cohort filters such as in_cohort and its negation can end up breaking our query engine #32449)

  2. Too many of any filter resulting in memory limit exceeded error from CH

  3. Celery worker doesn't fail gracefully or retry

    • This one still needs more investigation as the only symptoms of it are is_calculating=true and one of the following errors in CH:
      • Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: [::ffff:10.31.214.147]:40380, local: [::ffff:10.0.164.69]:9440). (NETWORK_ERROR) (version 24.8.14.39 (official build))
      • Code: 394. DB::Exception: Query was cancelled or a client has unexpectedly dropped the connection. (QUERY_WAS_CANCELLED) (version 24.8.14.39 (official build))
    • The is_calculating field in the cohorts table was never set back to false
    • Retries do seem to help this failure mode
    • Conversation with support eng where they ran into this and how to diagnose https://posthog.slack.com/archives/C07Q2U4BH4L/p1749241659854979
    • TODO: find out why the connection is being closed by the worker

Related tickets:

Debug info

Kind: support

Target area: cohorts

Report event: http://go/ticketByUUID/4722336e-1568-437f-80cf-464be7dc742d

Session: https://us.posthog.com/project/sTMFPsFhdP1Ssg/replay/019712ab-bdc4-7997-9ad5-4d8e79fbaa67?t=3601

Exceptions: https://us.posthog.com/project/2/error_tracking?filterGroup=%7B%22type%22%3A%22AND%22%2C%22values%22%3A%5B%7B%22type%22%3A%22AND%22%2C%22values%22%3A%5B%7B%22key%22%3A%22%24session_id%22%2C%22value%22%3A%5B%22019712ab-bdc4-7997-9ad5-4d8e79fbaa67%22%5D%2C%22operator%22%3A%22exact%22%2C%22type%22%3A%22event%22%7D%5D%7D%5D%7D

Location: https://us.posthog.com/project/85440

Persons-on-events mode for project: person_id_override_properties_on_events
@luke-belton luke-belton added bug Something isn't working right feature/cohorts Feature Tag: Cohorts labels May 28, 2025
@luke-belton
Copy link
Contributor Author

+1 #30995

@luke-belton
Copy link
Contributor Author

+1 #31177

@andyzzhao andyzzhao self-assigned this Jun 2, 2025
@benHPostHog
Copy link

@kmt901
Copy link
kmt901 commented Jun 5, 2025

@andyzzhao
Copy link
Contributor

Seems like this is quite common just looking at cohorts that have a last calculation greater than 24 hours

SELECT 
    -- team_id,
    COUNT(1) as cohort_count
FROM posthog_cohort 
WHERE last_calculation < NOW() - INTERVAL '24 hours'
AND is_static = false
GROUP BY team_id
ORDER BY cohort_count DESC;

3,745
2,096
979
886
693
632
...

@andyzzhao andyzzhao moved this to Todo in Feature Flags Jun 6, 2025
@andyzzhao andyzzhao moved this from Todo to In Progress in Feature Flags Jun 6, 2025
@andyzzhao
Copy link
Contributor

Looks like we got out first customer with a nested cohort that is seeing stale data: https://posthoghelp.zendesk.com/agent/tickets/32113

@benHPostHog
Copy link

+1 https://posthoghelp.zendesk.com/agent/tickets/32091 - saving again resolved the issue

@andyzzhao
Copy link
Contributor

The fix to restart stuck cohort calculations is rolled out but it's is going to take a while. Maybe we can increase the number of workers and parallel executions but I'll need to get some metrics to monitor and probably talk to team clickhouse

In US, we had 27k stuck yesterday and today we have 24k.

SELECT count(1)
FROM posthog_cohort
where deleted = false
and is_static = false
and is_calculating = true
and last_calculation < NOW() - INTERVAL '24 hours'
AND errors_calculating <= 20;

24,620

@andyzzhao
Copy link
Contributor

I reverted the restart stuck cohorts CR. See this thread for more details: https://posthog.slack.com/archives/C076R4753Q8/p1749310677162049

@benHPostHog
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right feature/cohorts Feature Tag: Cohorts team/feature-flags
Projects
Status: In Progress
Development

No branches or pull requests

5 participants
0