8000 Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads by gropaul · Pull Request #16301 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gropaul
Copy link
Contributor
@gropaul gropaul commented Feb 18, 2025

I noticed a regression for building very large hash tables. The issue is that with the current fixed size of tasks, for very big hash tables too many tasks are generated which leads to the hash table not being accessed sequentially for zeroing, which itself leads to performance regressions. I think the parallel zero is a great improvement, but I think it would be even better with a less granular parallelism taking into account the number of threads and the size of the hash table.

My benchmark consists of only building the join hash table on 100 000 000 unique keys by joining on an empty probe and disabling the optimizer to prevent join side swapping:

load
ATTACH '/Users/paul/micro.duckdb' AS micro;
USE micro;
PRAGMA disable_optimizer;
PRAGMA disable_progress_bar;

run
SELECT * FROM probe JOIN build ON probe.key = build.key;

Where

CREATE TABLE probe (key BIGINT);
CREATE TABLE build AS
  SELECT
      range AS key,
  FROM
      RANGE(0, 100_000_000)
  ORDER BY hash(key + 32);

Running on a Macbook Pro with an M4 and 8 threads I get the following performance numbers:

Experiment Strategy Average Timing
1 ENTRIES_PER_TASK = 131072 0.404637
2 entries_per_task = entry_count / num_threads / 16  0.336537
3 entries_per_task = entry_count / num_threads / 8 0.306807
4 entries_per_task = entry_count / num_threads / 4 0.286299
5 entries_per_task = entry_count / num_threads / 2 0.289076
6 entries_per_task = entry_count / num_threads / 1 0.285086

To still have more tasks than threads I now set the number of tasks to be 4 times the number of cores, but we could even think about having the same amount of tasks then cores. Let me know what you think.

@Mytherin Mytherin requested a review from lnkuiper February 18, 2025 17:13
Copy link
Contributor
@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Can we also check what this means for the performance of building smaller hash tables? I'm afraid this may over-parallelize when there are, e.g., 192 threads, but the hash table is only 20M (which would result in tiny tasks of only 26k). Maybe we can add a minimum entries per task as well?

const idx_t entries_per_task = MaxValue(entry_count / num_threads / 4, MINIMUM_ENTRIES_PER_TASK);

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 19, 2025 15:00
@gropaul
Copy link
Contributor Author
gropaul commented Feb 19, 2025

Good point! I added a MINIMUM_ENTRIES_PER_TASK which is equal to PARALLEL_CONSTRUCT_THRESHOLD / 8, which was also the previous ENTRIES_PER_TASK size

@gropaul gropaul marked this pull request as ready for review February 19, 2025 15:01
Copy link
Contributor
@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@gropaul
Copy link
Contributor Author
gropaul commented Feb 20, 2025

Hi @lnkuiper, I have a quick question/proposal

We now have this in the HashJoinTableInitEvent

const auto entry_count = ht.capacity;
auto num_threads = NumericCast<idx_t>(sink.num_threads);i
f (num_threads == 1 || (entry_count < PARALLEL_CONSTRUCT_THRESHOLD && !context.config.verify_parallelism)) {

And in the HashJoinFinalizeEvent we have

if (num_threads == 1 || ((ht.Count() < PARALLEL_CONSTRUCT_THRESHOLD || skew > SKEW_SINGLE_THREADED_THRESHOLD) && !context.config.verify_parallelism)) {

Both events have their own PARALLEL_CONSTRUCT_THRESHOLD; the first uses the hash table's capacity and the second the actual element count, which is smaller than the capacity. What do you think about creating one ShouldFinalizeSingleThreaded function that returns a bool and uses ht.Count() for both as for me this feels maybe a bit inconsistent and hard to maintain now.

@lnkuiper
Copy link
Contributor

@gropaul that sounds reasonable. Note that we can still memset in parallel if the data is skewed, but we shouldn't do parallel inserts into it, so this should only be checked in HashJoinFinalizeEvent and not in HashJoinTableInitEvent.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 21, 2025 13:55
@gropaul gropaul marked this pull request as ready for review February 21, 2025 13:56
Copy link
Contributor
@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@Mytherin Mytherin merged commit 244951e into duckdb:main Feb 24, 2025
51 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
krlmlr added a commit to duckdb/duckdb-r that 8000 referenced this pull request May 17, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301)
MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0