Improvements to Out-of-Core Hash Join #4970

lnkuiper · 2022-10-12T16:39:23Z

Continued where I left off. We now partition the probe side as well (instead of scanning the whole probe side multiple times), which means we can support out-of-core right/outer/mark/anti joins.

Probe-side partitioning happens during the streaming probe (PhysicalHashJoin::Execute): tuples that cannot be probed against the partitions of the hash table during this phase, because they belong to different partitions, are sunk into a PartitionedColumnData. However, if we can finish probing in just two rounds, the probe-side does not need to be partitioned at all, and we sink into a ColumnDataCollection instead.

PartitionedColumnData is a new generic partitioning interface. I've implemented RadixPartitionedColumnData. An issue with multi-threaded partitioning in general is that each thread allocates data for all partitions, which results in high memory usage. We prevent this by sharing a single allocator per partition across all threads. The PartitionedColumnData class can be extended to do, e.g., Hive partitioning in a streaming manner.

I've also implemented the ColumnDataConsumer class, which can read and consume a ColumnDataCollection. This is useful for the out-of-core hash join, as we wish to read the probe side ColumnDataCollection just once. Previously, the read data was written back to disk before being thrown away, which was wasteful.

I've run the same benchmark as in my previous PR on my laptop, which is joining two tables with 100M integers each, but they only have 1k matches. Here are the numbers:

Memory limit (GB)	Old time (s)	New time (s)
10	1.97	1.96
9	1.97	1.97
8	2.23	2.22
7	2.23	2.44
6	2.27	2.39
5	2.27	2.32
4	2.81	2.45
3	5.60	3.20
2	7.69	3.28
1	17.73	4.35

As we can see, the performance is mostly the same as my previous PR, until the hash table is many times larger than the amount of available memory. This is where this PR improves performance a lot. In the previous PR, when we had to do many partitioned probe phases, we created a lot of I/O pressure by reading and writing the entire probe side every time. In this PR, the I/O pressure is much less, as we are only reading and writing the probe side data once.

I will continue tweaking the performance in future PRs. Happy to receive feedback!

Edit: this PR uncovered a bug with how execution pipelines were scheduled. The problem was rather complicated, so I will try to explain it here.

We have parallelism in regular pipelines, e.g., SCAN -> FILTER -> PROJECTION -> AGGREGATION. The data is pushed from the source (SCAN) through the streaming FILTER and PROJECTION operators, into the sink (AGGREGATION).

Pipelines can also have dependencies on other pipelines, for example SCAN -> JOIN (build side of the join), and SCAN -> JOIN -> AGGREGATION (probe side of the join). The second pipeline depends on the first pipeline to be done.

Besides these, we also have two "special" cases of pipelines, namely "union" pipelines, for UNION queries, and for "child" pipelines, for streaming operators that become a source operator after streaming is done (e.g., scanning the hash table of a join for unmatched tuples for a FULL OUTER join). These cases are special because they share their sink with other pipelines.

We have run into problems with these special cases before because setting their dependencies up correctly is different from other pipelines. With the way the code was written before, it was very difficult to get things right. I've refactored pipeline construction with a new class called a MetaPipeline. This class holds multiple Pipelines that share the same sink. This makes it much easier to set up dependencies correctly.

As a result (bonus!) of this refactor, we can now execute union pipelines in parallel, e.g.:

SELECT * FROM t1
UNION ALL
SELECT * FROM t2
UNION ALL
SELECT * FROM t3

Before we would scan t1, then t2, then t3. If t1, t2, and t3 are small enough to not keep all threads busy, we would not utilize all threads for the entire query. Now, we can scan all three tables concurrently.

Quick test with sorting a union:

create table test as select cast(random() * 100000000 as int) i from range(100000);
with union_cte as (
select i from test
union all
select i from test
union all
select i from test
union all
select i from test)
select count(i) from (select i from union_cte order by i offset 1);

This PR: 0.014s
Master: 0.023s

…pendState

…in external hash join

…blem)

lnkuiper · 2022-11-02T19:06:29Z

I think I got the final thread sanitizer issues out of the way, but the R + arrow test seems to be failing. Is this the test that occasionally fails, or is this something I caused? 10 tests fail with an std::exception

Mytherin · 2022-11-02T20:19:37Z

Hm, these do not look like "the usual" spurious R test failures. At least I haven't seen them before. I will rerun the test.

lnkuiper · 2022-11-03T07:15:59Z

Thanks for re-running the tests. This time only 4 failures, seems like the same problem though. Will investigate.

Mytherin

Thanks for all the fixes! Looks great. The meta pipeline refactor looks very good. Some remaining comments from my side.

src/common/types/column_data_allocator.cpp

src/execution/join_hashtable.cpp

Mytherin · 2022-11-04T10:10:38Z

src/include/duckdb/execution/operator/order/physical_order.hpp

@@ -33,6 +33,10 @@ class PhysicalOrder : public PhysicalOperator {
 	void GetData(ExecutionContext &context, DataChunk &chunk, GlobalSourceState &gstate,
 	             LocalSourceState &lstate) const override;

+	bool IsOrderPreserving() const override {
+		return false;


Is this correct? Won't this lead to incorrect parallel result set materialization?

This is correct. It sounds confusing, but the PhysicalOrder does not preserve the insertion order - the data is completely reordered.

test/optimizer/statistics/statistics_setop.test

lnkuiper · Nov 4, 2022

Thanks for the feedback! I've implemented it. Together with Hannes, I've figured out what the R + arrow issue was. R does not like it if threads other than the main thread call back into R, which was what happened for arrow scans when I refactored the pipeline construction.

I've tried now to make sure that table function global source state initialization always happens in the main thread. Fingers crossed for CI 🤞

lnkuiper · 2022-11-07T18:45:14Z

Looks like my changes fixed the last failing R + arrow tests. I think the failures are unrelated. Let's wait for the full CI run though, which is taking forever because everyone is so productive!

lnkuiper added 30 commits September 29, 2022 11:23

init streaming partitioner

0f47295

initial implementation of PartitionedColumnData

94dcde8

lock for shared column data allocator

fff7c52

probe-side partitioning seems to (kind of) work!

0e68b4f

probe-side partitioning unit tests passing!

b041d83

chunk partitioning performance improvements

7354665

offset pointer to sel vector instead of passing offset

014c121

sort segments, ensure block size is BLOCK_SIZE

8f8dbbe

Merge branch 'master' into oochj

7e9048b

less locking and add ColumnDataAppendState to PartitionedColumnDataAp…

fbe4685

…pendState

BLOCK_SIZE instead of BLOCK_ALLOC_SIZE, less tasks for ExternalBuild …

cdfd9fb

…in external hash join

add a bunch of TODO's so I don't forget

de6b7d5

Merge branch 'master' into oochj

2154c1b

rework PartitionedColumnData API a bit

97c895d

some code cleanup

e2cfe1c

Merge branch 'master' into oochj

5a59819

progress on scanning/consuming a ColumnDataCollection

0d4dbbb

start work on ColumnDataConsumer

6709a48

properly implement ColumnDataConsumer for external hash join

146282f

Merge branch 'master' into oochj

19021d6

always init probespill and re-enable anti/outer external joins

bd167f6

delete block handles/pins more cleanly

3bcde35

Merge branch 'master' into oochj

746cde5

enable out-of-core for anti/semi/mark joins

39ef799

add ProbeSpillLocalState to prevent some segfaultage

7eba41d

simplify external GetData init

8cfb189

Merge branch 'master' into oochj

688de0d

some code cleanup and trying to make CI happy

8d01507

Merge branch 'master' into oochj

2d711c1

fix bug with CountValid (took me FOREVER to find out this was the pro…

96ef927

…blem)

lnkuiper added 9 commits November 1, 2022 09:50

fix bug with union all order preservation

fd943f9

refactor IEJoin pipelines to have a single MetaPipeline

e366a00

properly set recursive CTE in MetaPipeline and add some missing includes

2433ade

trying to please the CI

34a0ef5

trying to please CI and dodge threadsan data races

88cade3

still trying to dodge threadsan data races

148f699

Add lock to RandomEngine to make RandomInitLocalState thread-safe

691c0cf

Merge branch 'master' into oochj

fe9887e

add missing include

e87e469

Mytherin reviewed Nov 4, 2022

View reviewed changes

lnkuiper added 3 commits November 4, 2022 14:19

reset source in main thread for R

0829e35

Merge branch 'master' into oochj

e5b63e1

implement PR feedback and fix skipped tests

1ddc86d

lnkuiper added 4 commits November 7, 2022 08:56

Merge branch 'master' into oochj

e86917e

init global source state before calling continue in loop ... oops

cf5744d

merge with master

79bc672

add missing return statement

9090797

Mytherin merged commit 726d843 into duckdb:master Nov 8, 2022

lnkuiper mentioned this pull request Nov 10, 2022

Add correlated columns to LogicalDistinct::distinct_targets when flattening dependent joins #5286

Merged

lnkuiper deleted the oochj branch November 21, 2022 08:07

samansmink mentioned this pull request Jan 23, 2023

Copy into partition by #5964

Merged

Mytherin mentioned this pull request Mar 20, 2023

Fix #2743 by removing NotImplementedException in CreateUnionPipeline #6789

Merged

kryonix mentioned this pull request Mar 8261 23, 2023

Enable BuildPipelines for nested recursive CTEs #6838

Merged

lnkuiper mentioned this pull request Apr 7, 2023

Tuple Data Collection #6998

Merged

lnkuiper mentioned this pull request Jun 7, 2023

Grab Mark Join lock when using shared correlated_mark_join_info #7859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to Out-of-Core Hash Join #4970

Improvements to Out-of-Core Hash Join #4970

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improvements to Out-of-Core Hash Join #4970

Improvements to Out-of-Core Hash Join #4970

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!