Reduce locking with FILE_SIZE_BYTES
/ROW_GROUPS_PER_FILE
in Parquet writer
#16928
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When all threads work on
COPY TO
to a single output file, there will always be some contention. We try too keep this to a minimum. However, for theFILE_SIZE_BYTES
/ROW_GROUPS_PER_FILE
parameters, we used a single lock to manage rotating to the next file, which leads to very low CPU utilization.When threads write to a single output file (using
PER_THREAD_OUTPUT
), there isn't any contention:The
COPY TO
query finishes in ~2.2s on my laptop. When we removePER_THREAD_OUTPUT
:The query finishes in ~16s, more than 7x slower, even though there are 10 threads available.
In this PR, I've reworked the locking mechanism, which has brought the time of the second query down to ~2.3s, only slightly slower than with
PER_THREAD_OUTPUT
.