8000 Optimize literals encoding during compression by Nicoshev · Pull Request #1411 · lz4/lz4 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Optimize literals encoding during compression #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

Nicoshev
Copy link
Contributor
@Nicoshev Nicoshev commented May 14, 2024

Hello, I hope you are doing well : )

I wanted to propose a change into the way the literals are written onto the output buffer during compression.

The intent is to take advantage of the existing execution fork yielded by 'if (litLength >= RUN_MASK)'
The wildCopy loop is now ommited when less than 15 literals need to be copied.
The precondition of having to copy at least 15 literals is used to process them chunks of 16 bytes.

Regarding performance, the same two scenarios of PR#1407 were tested.

Scenario 1:
CPU: Ryzen 5900X
Compilers: gcc-11, clang-8.
Files: enwik9, each file within the silesia corpus, a custom memory dump retrieved from my PC.
Results: Decompression throughput improved in all combinations of files and compilers.

Scenario 2:
CPU: i5-1245U
Compilers: gcc-[9-12], clang-[11-14].
Files: enwik9, 'xml' from the silesia corpus, the custom memory dump retrieved from my PC.
Results: Decompression throughput improved in all combinations of files and compilers.

On the Ryzen 5900:
The change yielded compression throughput improvements for all files when clang was used.
Regarding gcc, throughput decreased for 'sao' and 'x-ray' and increased for all other 12 files.
There seems to be an inclination towards the change improving throughput.

On the weaker i5-1245U:
on gcc-9: 10 files showed better throuhgput, 4 files worsened
on gcc-10: 12 files showed better throuhgput, 2 files worsened
on gcc-11: 13 files showed better throuhgput, 1 file worsened
on gcc-12: 13 files showed better throuhgput, 1 file worsened
on clang-11: 10 files showed better throuhgput, 4 files worsened
on clang-12: 12 files showed better throuhgput, 2 files worsened
on clang-13: 13 files showed better throuhgput, 1 files worsened
on clang-14: 12 files showed better throuhgput, 2 files worsened
Although it has more mixed results, it also showed a noticeable inclination towards including the change

Overall, there seems to be a trend towards improving performance by including the change.
It is relevant to remark that the trend improves as more powerful CPUs and more modern compilers are used.

Thanks,
Nicolas

@Cyan4973 Cyan4973 self-assigned this May 14, 2024
@Cyan4973
Copy link
Member

Differentiating literal copy operations depending on their lengths seems a good idea.
It's obvious that branch statistics will be different depending on knowing in advance if the length is short or long(er).

In term of execution, I still see a number of branches in this PR, and I would likely consider a slightly different variant, trying to reach some wildcopy16 or equivalent (note: it doesn't exist yet, we only have wildcopy8 and wildcopy32), though it strongly depends on guarantees of distance from end of buffer, which are not obvious at this point in the code base.

@Nicoshev
Copy link
Contributor Author

The existing wildCopyX functions can overwrite up to X bytes after the end of the passed buffer.
In this case, after the space for the literals, we can only guarantee 13 bytes.
That is why the extra 8 byte copy is appended after the loop: to handle the cases where (litLength%16)<8.
Since litLength is at least 15, the first 16 bytes written are guaranteed to be within the output buffer limits.

I do not mind adding a function with this behaviour:
Copies in chunks of 16 bytes, only overwriting up to 8 bytes after the first iteration.

But just wanted to make sure that you are fine with having a slightly different wildcopy function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0