8000 Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer by lnkuiper · Pull Request #16928 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Reduce locking with FILE_SIZE_BYTES/ROW_GROUPS_PER_FILE in Parquet writer #16928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 1, 2025

Conversation

lnkuiper
Copy link
Contributor
@lnkuiper lnkuiper commented Apr 1, 2025

When all threads work on COPY TO to a single output file, there will always be some contention. We try too keep this to a minimum. However, for the FILE_SIZE_BYTES/ROW_GROUPS_PER_FILE parameters, we used a single lock to manage rotating to the next file, which leads to very low CPU utilization.

When threads write to a single output file (using PER_THREAD_OUTPUT), there isn't any contention:

call dbgen(sf=10);
set preserve_insertion_order=false;
copy lineitem to 'lineitem' (format parquet, row_group_size_bytes '128mb', file_size_bytes '128mb', per_thread_output true, overwrite true);

The COPY TO query finishes in ~2.2s on my laptop. When we remove PER_THREAD_OUTPUT:

copy lineitem to 'lineitem' (format parquet, row_group_size_bytes '128mb', file_size_bytes '128mb', overwrite true);

The query finishes in ~16s, more than 7x slower, even though there are 10 threads available.

In this PR, I've reworked the locking mechanism, which has brought the time of the second query down to ~2.3s, only slightly slower than with PER_THREAD_OUTPUT.

@Mytherin Mytherin merged commit 5333f60 into duckdb:main Apr 1, 2025
46 of 47 checks passed
@Mytherin
Copy link
Collaborator
Mytherin commented Apr 1, 2025

Thanks! LGTM

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer (duckdb/duckdb#16928)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer (duckdb/duckdb#16928)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer (duckdb/duckdb#16928)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer (duckdb/duckdb#16928)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Reduce locking with `FILE_SIZE_BYTES`/`ROW_GROUPS_PER_FILE` in Parquet writer (duckdb/duckdb#16928)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
N 4860 one yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0