8000 Add support for materialized CTEs by kryonix · Pull Request #7126 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add support for materialized CTEs #7126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 83 commits into from
Jun 20, 2023
Merged

Add support for materialized CTEs #7126

merged 83 commits into from
Jun 20, 2023

Conversation

kryonix
Copy link
Contributor
@kryonix kryonix commented Apr 18, 2023

This pr adds support for materialized CTE expressions as an alternative to inlining, as I briefly mentioned in #404.

Inlining CTEs can be great, but it can also multiply the work. Take this query for example:

WITH t(x) AS
  (SELECT t FROM generate_series(1,10) AS _(t))
SELECT t1.x, t2.x FROM t AS t1, t AS t2;

Inlining duplicates the definition of t for each reference.
This results in the following query:

SELECT t1.x, t2.x FROM (SELECT t FROM generate_series(1,10) AS _(t)) AS t1(x),
                       (SELECT t FROM generate_series(1,10) AS _(t)) AS t2(x);

Now, if t is an expensive query, there may be a problem.


With this pr, CTEs can be annotated as being MATERIALIZED.
Such CTEs first fully materialize their result, which can then be read:

WITH t(x) AS MATERIALIZED
  (SELECT t FROM generate_series(1,10) AS _(t))
SELECT t1.x, t2.x FROM t AS t1, t AS t2;

Instead of copying t for each reference, t is calculated exactly once. Since the result is cached, no work is done multiple times.

@kryonix
Copy link
Contributor Author
kryonix commented Apr 18, 2023

This pr is currently a draft, as there are two remaining issues:


First, something during binding seems to be broken, as the following fails:

SELECT x FROM (WITH t(y, z) AS MATERIALIZED (SELECT 1, 2) SELECT * FROM t) AS _(x, y);
Error: INTERNAL Error: Failed to bind column reference "x" [0.0] (bindings: [9.0])

Edit: This is fixed.


Second, pipeline construction of two (or more) materialized recursive CTEs fails:

WITH RECURSIVE t(x) AS MATERIALIZED
(
  SELECT 1
    UNION ALL
  SELECT x+1
  FROM   t
  WHERE  x < 4
)
,
u(x) AS
(
  SELECT *
  FROM   t
    UNION ALL
  SELECT u.x + t.x
  FROM   u, t
  WHERE  u.x < 32
)
SELECT *
FROM   u;
Error: INTERNAL Error: Assertion triggered in file "/duckdb/src/parallel/executor.cpp" on line 156: event_map_entry != event_map.end()

Edit: This is fixed.

Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - very cool feature! I have left some comments inline. One note is that we have been working to remove C-style pointers (see #7080) - this might cause a few merge conflicts on your end.

SELECT x FROM (WITH t(y, z) AS MATERIALIZED (SELECT 1, 2) SELECT * FROM t) AS _(x, y);

Error: INTERNAL Error: Failed to bind column reference "x" [0.0] (bindings: [9.0])

This is the column binding resolver that complains. Essentially "x" is bound as the first column to table binding 0, but table binding 0 is not accessible from that part of the plan.

Second, pipeline construction of two (or more) materialized recursive CTEs fails:

I'm not sure about what causes this exactly - I would guess that something goes wrong in the order in which BuildPipelines is executed so that the layered situation is not resolved correctly, or perhaps the incorrect pipeline is passed on to the child.

Perhaps it would also be interesting to add a benchmark where the benefits of the materialized CTEs are visible?

@kryonix
Copy link
Contributor Author
kryonix commented Apr 21, 2023

The binder issue seems to be resolved now.

Thank you for the comments. I will address them.

Thanks for the PR! Looks good - very cool feature! I have left some comments inline. One note is that we have been working to remove C-style pointers (see #7080) - this might cause a few merge conflicts on your end.

I think I will just rebase onto master and do the rewritings before I take this pr out of its draft status. Should be a clean solution.

< 8000 !-- no margin wins, so we check it last and use its value if true. -->

kryonix added 26 commits May 9, 2023 10:43
@kryonix
Copy link
Contributor Author
kryonix commented May 25, 2023

I think all remaining issues are now resolved.

@kryonix kryonix requested a review from Mytherin May 25, 2023 15:01
@kryonix
Copy link
Contributor Author
kryonix commented May 25, 2023

NVM I guess. ALTERNATIVE_VERIFY still has some issues 👀

@kryonix
Copy link
Contributor Author
kryonix commented Jun 7, 2023

If you find the time, @Mytherin, I think this pr is ready for review now.

Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Looks great to me now. One minor comment otherwise this is good to go. Could you also merge with feature so the CI can rerun? I believe the CI failures should be fixed then.

TransformCTE(*PGPointerCast<duckdb_libpgquery::PGWithClause>(stmt.withClause), result->cte_map,
materialized_ctes);
if (!materialized_ctes.empty()) {
throw NotImplementedException("Materialized CTEs are not implemented for delete.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be update - also could you perhaps expand on the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the message is wrong. I will fix that.


There is an AST 'mismatch' currently. Materialized CTEs are represented as a QueryNode, whereasINSERT/UPDATE/DELETE are SQLStatements. This means, that it is easy to represent the materialized CTEs as a bunch of stacked QueryNodes (this happens in Transformer::TransformMaterializedCTE). It is not that easy for INSERT/UPDATE/DELETE, however. While there should not be a principle problem implementing that, my current plan is to tackle that in a separate pr.

statement ok
PRAGMA enable_verification

require noalternativeverify
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary here?

Copy link
Contributor Author
@kryonix kryonix Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative verify would lead to non-MATERIALIZED CTEs here. Therefore, the tests would be unexpectedly successful.

@Mytherin Mytherin merged commit 09fd4e3 into duckdb:feature Jun 20, 2023
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0