-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add new recursive semantics (USING KEY
)
#12430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ix ToString of query node
speed up from 5 times slower than normal to 2 times slower
57404ca
to
88c6564
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks great! Nice results. Some comments below:
explicit RecursiveKeyCTEState(ClientContext &context, const PhysicalRecursiveKeyCTE &op) | ||
: intermediate_table(context, op.GetTypes()), new_groups(STANDARD_VECTOR_SIZE) { | ||
|
||
vector<BoundAggregateExpression *> payload_aggregates_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: vector<reference<BoundAggregateExpression>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good idea, but AggregateHashTable
needs a vector of type vector<BoundAggregateExpression *>
. Therefore I've used the raw C-pointers as well. I could adapt the AggregateHashTable
to use a vector<reference<...>>
but not sure if that was your intention?
|
||
// btodo: Maybe there is another way, we handle unique pointers and therefore can only copy the keys. | ||
for (auto &key : info.key_targets) { | ||
result.key_targets.emplace_back(key->Copy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move here instead of copying? (std::move
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As my todo comment above this line also mentions, I am not happy with this solution and hope to find a better way. But during the preparation of the statement, we copy the cte_map and when we move the keys, we get an error for dereferencing a NULL pointer.
void QueryNode::CopyProperties(QueryNode &other) const { ...
...
for (auto &kv : cte_map.map) { ...
...
for (auto &key : kv.second->key_targets) { ...
kv_info->key_targets.push_back(key->Copy());
}
....
|
||
|
||
|
||
query II |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perhaps include a larger benchmark comparing the standard recursive CTEs vs the new ones as part of the benchmark runner (see e.g. benchmark/micro/cast/cast_date_string.benchmark
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will write more test, when the syntax is determined. 👍
// and we clear the intermediate table | ||
gstate.intermediate_table.Reset(); | ||
// now we need to re-execute all of the pipelines that depend on the recursion | ||
ExecuteRecursivePipelines(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely unrelated to this PR - but given you mentioned recursive CTE performance - I've profiled recursive CTEs before and in many cases (almost) all of the time went into the breaking down/setting up of pipeline states. If you're interested in optimizing recursive CTEs, reusing these states across iterations would likely lead to significant performance improvements as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that sounds interesting. I will look into it after this Pull-Request.
13c03df
to
c9978c5
Compare
The PR should be fine, the only check that seems to fail is an https error from vcpkg. I did write something to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Did another pass and I think this is good to go now - could you just solve the merge conflicts?
Thanks! |
Exciting update! I'm curious to know if this enhancement also enables the use of aggregations within recursive CTEs similar to what's described in the Materialize documentation here: https://materialize.com/docs/sql/select/recursive-ctes/. |
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
Add new recursive semantics (`USING KEY`) (duckdb/duckdb#12430)
This PR adds the optional modifier
USING KEY
to recursive CTEs.The semantics of a recursive CTE only differ slightly from the regular
recursive CTEs. Yet, the performance impact can be substantial. Below,
we motivate the
USING KEY
modifier using two example queries thatfocus on the efficient evaluation of common algorithms over graphs. We
have, however, found the modifier to be beneficial for a wide range of
recursive (or iterative) computations over relational data. For the
interested, you can find an in-depth discussion of the rationale, the
semantics, and applications of
USING KEY
in the following paper:In a nutshell,
USING KEY
alters the behavior of a regular recursive CTE as follows:In each iteration, a regular CTE appends results rows to the union
table which, ultimately defines the overall result of the CTE.
In contrast, a CTE with
USING KEY
has the ability to update rows thathave been placed in the union table in an earlier iteration: if the
current iteration produces a row with key k, it replaces a row with the
same key k in the union table. If no such row exists in the union table
yet, the new row is appended to the union table as usual.
This allows a CTE to exercise fine-grained control over the union table
contents. Avoiding the append-only behavior can lead to significantly
smaller union table sizes. This helps query runtime, memory consumption,
and makes it feasible to access the union table *while the iteration is
still ongoing (this is impossible for regular recursive CTEs): in a
CTE
WITH RECURSIVE T(...) USING KEY ...
, tableT
denotes therows added by the last iteration (as is usual for recursive CTEs), while
table
recurring_T
denotes the union table built so far. Referencesto
recurring_T
allow for the elegant and idiomatic translation of rathercomplex algorithms into readable SQL code.
Here is an example of a CTE with
USING KEY
with a key column (a
) and a payload column (b
). In the first iteration we have two different keys,1
and2
. These two rows will generate two new rows,(1, 3)
and(2, 4)
. In the next iteration we produce a new key,3
, which will generate a new row. We also generate the row(2,3)
, where2
is a key that already exists from the previous iteration. This will overwrite the old payload4
with the new payload3
.The result of the CTE with
USING KEY
would be as follows:Normally, a CTE adds all rows from each iteration to the union table, like the table below.
Already in a small example we can see that the resulting table for
USING KEY
grows very slowly in comparison to the normal CTE.Perfomance
The given example is simple, but not really useful for comparing the computation time of the
USING KEY
CTE and the normal recursive CTE. Therefore, we decided to formulate two algorithms for graphs that exploit the full potential of the new semantic, starting with the Connected Components (CC) and then the Distance Vector Routing (DVR), and measured the time between theUSING KEY
variant and a handcrafted version written with the standardWITH RECURSIVE
. The queries used the LDBC SNB Person_knows_Person and Person tables as benchmarks.CC
The connected components algorithm computes, for each node in a graph, the node with the smallest id to which it is connected. With a scaling factor of 0.003, the
WITH RECURSIVE
takes 78.394s on my machine.With the newly implemented semantics, the query below takes only 9.5 seconds.
DVR
The second algorithm measured is DVR. It is used to compute the shortest path between all nodes in a network. Again, we used LDBC SNB with a scaling factor of 0.003. The query using only
WITH RECURSIVE
takes 14 seconds.With the newly implemented semantics, the query below takes only 0.7 seconds.