Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

gropaul · 2025-02-18T16:49:18Z

I noticed a regression for building very large hash tables. The issue is that with the current fixed size of tasks, for very big hash tables too many tasks are generated which leads to the hash table not being accessed sequentially for zeroing, which itself leads to performance regressions. I think the parallel zero is a great improvement, but I think it would be even better with a less granular parallelism taking into account the number of threads and the size of the hash table.

My benchmark consists of only building the join hash table on 100 000 000 unique keys by joining on an empty probe and disabling the optimizer to prevent join side swapping:

load
ATTACH '/Users/paul/micro.duckdb' AS micro;
USE micro;
PRAGMA disable_optimizer;
PRAGMA disable_progress_bar;

run
SELECT * FROM probe JOIN build ON probe.key = build.key;

Where

CREATE TABLE probe (key BIGINT);
CREATE TABLE build AS
  SELECT
      range AS key,
  FROM
      RANGE(0, 100_000_000)
  ORDER BY hash(key + 32);

Running on a Macbook Pro with an M4 and 8 threads I get the following performance numbers:

Experiment	Strategy	Average Timing
1	ENTRIES_PER_TASK = 131072	0.404637
2	entries_per_task = entry_count / num_threads / 16	0.336537
3	entries_per_task = entry_count / num_threads / 8	0.306807
4	entries_per_task = entry_count / num_threads / 4	0.286299
5	entries_per_task = entry_count / num_threads / 2	0.289076
6	entries_per_task = entry_count / num_threads / 1	0.285086

To still have more tasks than threads I now set the number of tasks to be 4 times the number of cores, but we could even think about having the same amount of tasks then cores. Let me know what you think.

lnkuiper

Thanks for the PR! Can we also check what this means for the performance of building smaller hash tables? I'm afraid this may over-parallelize when there are, e.g., 192 threads, but the hash table is only 20M (which would result in tiny tasks of only 26k). Maybe we can add a minimum entries per task as well?

const idx_t entries_per_task = MaxValue(entry_count / num_threads / 4, MINIMUM_ENTRIES_PER_TASK);

gropaul · 2025-02-19T15:01:31Z

Good point! I added a MINIMUM_ENTRIES_PER_TASK which is equal to PARALLEL_CONSTRUCT_THRESHOLD / 8, which was also the previous ENTRIES_PER_TASK size

lnkuiper

Thanks for the changes!

gropaul · 2025-02-20T09:55:59Z

Hi @lnkuiper, I have a quick question/proposal

We now have this in the HashJoinTableInitEvent

const auto entry_count = ht.capacity;
auto num_threads = NumericCast<idx_t>(sink.num_threads);i
f (num_threads == 1 || (entry_count < PARALLEL_CONSTRUCT_THRESHOLD && !context.config.verify_parallelism)) {

And in the HashJoinFinalizeEvent we have

if (num_threads == 1 || ((ht.Count() < PARALLEL_CONSTRUCT_THRESHOLD || skew > SKEW_SINGLE_THREADED_THRESHOLD) && !context.config.verify_parallelism)) {

Both events have their own PARALLEL_CONSTRUCT_THRESHOLD; the first uses the hash table's capacity and the second the actual element count, which is smaller than the capacity. What do you think about creating one ShouldFinalizeSingleThreaded function that returns a bool and uses ht.Count() for both as for me this feels maybe a bit inconsistent and hard to maintain now.

lnkuiper · 2025-02-20T12:32:25Z

@gropaul that sounds reasonable. Note that we can still memset in parallel if the data is skewed, but we shouldn't do parallel inserts into it, so this should only be checked in HashJoinFinalizeEvent and not in HashJoinTableInitEvent.

lnkuiper

Thanks for the changes!

Mytherin · 2025-02-24T07:54:56Z

Thanks!

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)

set entries_per_task so that there are 4x more tasks than threads

9413e1a

Mytherin requested a review from lnkuiper February 18, 2025 17:13

lnkuiper suggested changes Feb 19, 2025

View reviewed changes

added MINIMUM_ENTRIES_PER_TASK for ht parallel zeroing

854794c

duckdb-draftbot marked this pull request as draft February 19, 2025 15:00

gropaul marked this pull request as ready for review February 19, 2025 15:01

lnkuiper approved these changes Feb 19, 2025

View reviewed changes

small refactor for ht single threaded build decision

8f4fdff

duckdb-draftbot marked this pull request as draft February 21, 2025 13:55

gropaul marked this pull request as ready for review February 21, 2025 13:56

lnkuiper approved these changes Feb 24, 2025

View reviewed changes

lnkuiper added the Ready To Merge label Feb 24, 2025

Mytherin merged commit 244951e into duckdb:main Feb 24, 2025
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!