-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads #16301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Can we also check what this means for the performance of building smaller hash tables? I'm afraid this may over-parallelize when there are, e.g., 192 threads, but the hash table is only 20M (which would result in tiny tasks of only 26k). Maybe we can add a minimum entries per task as well?
const idx_t entries_per_task = MaxValue(entry_count / num_threads / 4, MINIMUM_ENTRIES_PER_TASK);
Good point! I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
Hi @lnkuiper, I have a quick question/proposal We now have this in the
And in the
Both events have their own |
@gropaul that sounds reasonable. Note that we can still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
Thanks! |
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
Parallel HT Zeroing: Set entries_per_task so that there are 4x more tasks than threads (duckdb/duckdb#16301) MAIN_BRANCH_VERSIONING: main branch to get descriptors like v1.3.0-dev1234 instead of v1.2.1-dev1234 (duckdb/duckdb#16366)
I noticed a regression for building very large hash tables. The issue is that with the current fixed size of tasks, for very big hash tables too many tasks are generated which leads to the hash table not being accessed sequentially for zeroing, which itself leads to performance regressions. I think the parallel zero is a great improvement, but I think it would be even better with a less granular parallelism taking into account the number of threads and the size of the hash table.
My benchmark consists of only building the join hash table on
100 000 000
unique keys by joining on an empty probe and disabling the optimizer to prevent join side swapping:Where
Running on a Macbook Pro with an M4 and 8 threads I get the following performance numbers:
To still have more tasks than threads I now set the number of tasks to be 4 times the number of cores, but we could even think about having the same amount of tasks then cores. Let me know what you think.