-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Verify UTF-8 in DeltaLengthByteArrayDecoder
and speed it up
#16328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… 8-byte and 1-byte UTF-8 checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks great - one comment:
Will verification be skipped if you know the data is written by duckdb for example? |
@arjenpdevries No, we always perform the UTF-8 check. It would be a nice optimization to skip the checks if we wrote the file, but it also saves us from files that may have been tampered with. Maybe we could add a read parameter to disable the check? Although that could lead to non-UTF-8 being ingested, and I'm not sure if that would be handled gracefully, it might lead to undefined behavior. |
Thanks! |
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Verify UTF-8 in `DeltaLengthByteArrayDecoder` and speed it up (duckdb/duckdb#16328)
Although Parquet should be valid UTF-8, we can never be sure what other writers do, so we validate this. This validation was already there for
PLAIN
/RLE_DICTIONARY
encoding but was missing forDELTA_LENGTH_BYTE_ARRAY
. This PR adds the verification there as well.Verifying UTF-8 takes is somewhat costly, so I've also worked on speeding it up by checking 8 bytes at a time, instead of 1 byte. This is especially nice for
DELTA_LENGTH_BYTE_ARRAY
, as the strings are stored without their lengths in between, so we can verify many strings in one go.