8000 Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up by lnkuiper · Pull Request #16328 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Verify UTF-8 in DeltaLengthByteArrayDecoder and speed it up #16328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 21, 2025

Conversation

lnkuiper
Copy link
Contributor

Although Parquet should be valid UTF-8, we can never be sure what other writers do, so we validate this. This validation was already there for PLAIN/RLE_DICTIONARY encoding but was missing for DELTA_LENGTH_BYTE_ARRAY. This PR adds the verification there as well.

Verifying UTF-8 takes is somewhat costly, so I've also worked on speeding it up by checking 8 bytes at a time, instead of 1 byte. This is especially nice for DELTA_LENGTH_BYTE_ARRAY, as the strings are stored without their lengths in between, so we can verify many strings in one go.

Copy link
Collaborator
@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great - one comment:

@arjenpdevries
Copy link
Contributor

Will verification be skipped if you know the data is written by duckdb for example?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 21, 2025 08:33
@lnkuiper lnkuiper marked this pull request as ready for review February 21, 2025 08:33
@lnkuiper
Copy link
Contributor Author

Will verification be skipped if you know the data is written by duckdb for example?

@arjenpdevries No, we always perform the UTF-8 check.

It would be a nice optimization to skip the checks if we wrote the file, but it also saves us from files that may have been tampered with.

Maybe we could add a read parameter to disable the check? Although that could lead to non-UTF-8 being ingested, and I'm not sure if that would be handled gracefully, it might lead to undefined behavior.

@Mytherin Mytherin merged commit ef88e2b into duckdb:main Feb 21, 2025
51 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
@lnkuiper lnkuiper deleted the parquet_stuff branch April 14, 2025 09:09
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0