8000 Add RFC for query migrations by thmsobrmlr · Pull Request #293 · PostHog/meta · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add RFC for query migrations #293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions requests-for-comments/2025-02-25-query-migrations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Request for comments: Query Migrations

## Problem statement

We'd like to evolve the [query schema](https://github.com/PostHog/posthog/blob/master/frontend/src/queries/schema/index.ts) while keeping backwards compatibility for existing queries.

## Context

Queries are persisted in various places:

1. The `query` field of the Insights model.
2. Local storage for the "query draft" feature.
3. The url for shareable insight links.
4. Notebook nodes (and canvas element).
5. The activity log, when changing a query.
6. Queries that are converted from `filters` e.g. when users create an insight by directly calling the api.
7. ...? _(let me know)_

Some of these places are available frontend-side (e.g. queries in a url) and some are available backend-side (e.g. queries in an insight model).

This makes it difficult to migrate a query with a non-backwards compatibly change e.g. converting a boolean to an enum. Backward compatible changes e.g. adding an optional field are no problem.

Any errors in a query cause Pydantic to fail the validation and thus insights with the error will fail to compute.

## Design

Migrating the schema from one version to the next should be as easy as a database migration for developers.

### Option a) Add a backend side migration and allow the frontend to use this in a new endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also experienced a bunch of issues from having conversions on both the frontend and backend, and so I think this makes sense - we'd need to make sure this endpoint was very fast though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 I'm thinking it's alright if we have a quick way to return queries that don't need to be touched. We'd then keep backend side queries up-to-date with a backfill job and frontend side queries that are important should be touched (and with that upgraded) often as well.


We could add only a backend side migration and let the frontend call a `/query/validation` endpoint that also migrates queries to newer versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since queries are always retrieved from the backend, in most cases we can do something sneakier that avoids the roundtrip: In all of the call sites from the "Queries are persisted in various places" list, we'd do the /query/validation transformation internally before the query ever gets to the user. This would mean the frontend never seeing old-style queries.

With one exception: filters in the URL – where the roundtrip would still be necessary, and would be quite painful (e.g. in some cases the frontend would not be able to support the query anymore, so it would have to wait for the validation round trip – a few hundred milliseconds of a wait doesn't sound so bad, but it can definitely add up to a slow experience!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we definitely want to handle this backend side as much as possible. I'll probably find out how easy that is for notebooks, activity logs, etc when working on it 😆

We also have queries stored in local storage, so these need to do the roundtrip as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would migrating the query take few hundred ms? Correct me if wrong, but the /query/validation should parse => apply migrations => serialize, it should not take more than 50-100ms?

It would also be easy to cache those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation itself should be quick, but there's also significant overhead to every individual API request, plus internet speeds vary – piece by piece these things add up to a slow app experience, translating to lower retention → less 💵.


#### Pros

- Only one place where we'd need to add a new migration
- Probably the easiest one to maintain

#### Cons

- The frontend needs to do an additional round-trip whenever a new query is encountered
- Might be hard to implement frontend side due to the async nature e.g. when migrating notebook nodes

### Option b) Add a backend side and frontend side migration

#### Pros

- This is how we are currently doing it / trying to do it
- No round-trip to the backend necessary for new queries

#### Cons

- We need to maintain a frontend side and backend side migration. It'll be hard to keep them in sync
- We don't have immediate feedback from the Pydantic validation on validity of queries that are migrated frontend side

### Option c) Allow only backward compatible changes

#### Pros

- Easiest to implement
- Easiest to maintain

#### Cons

- We can only add optional fields
- Technical debt will increase

I'm favoring option a, where we add a backend side migration and call that from the frontend, as it seems to be the easiest (and as such, most scalable) solution for future developers adding query migrations.

### Option d) Something easy I'm not seeing right now?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered something like a library/package with the conversion code compiled to both Python and JS? Not sure if something like this exists, but it would help reduce code duplication and things getting out of sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we could transpile code from one language to the other or run the migrations inside of a VM. I'm worrying that this is brittle and would blow up bundle size (if using python as source of truth) however.


- Let me know.

## Implementation

- Get in touch with customers to deprecate `filters` based insights (already happening). Remove all remaining `filters` conversions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big 👍 on getting filters removed asap

- Implement the above.

To be discusssed:
- Should we add a version field to the queries, so that we know which migrations to run
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may start to version out API and put the version in each request. This would allow us to discover all situations where someone is calling our trunk API with old app version, whatever reason.

- Should we keep migrations forever or should we have a system in place that discards frontend side queries that are older than a certain time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you talk about a "migration" here, what does that look like in code? Are we gonna have a history of the model changes over time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was thinking. More concretely I'm thinking of adding a folder with query migrations like:

posthog/schema_migrations/
    0001_trends_query_bool_to_enum.py
    0002_insight_query_date_range_refactor.py
    ...

and each file has a class/methods to determine on which type of query to run and how to modify from one version to the next. I'm thinking we want to have separate versions for each query type so that not all queries have to be "migrated" if just changing on kind.

0