8000 [LZ4] Branch on token instead of length by Nicoshev · Pull Request #1611 · lz4/lz4 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[LZ4] Branch on token instead of length #1611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

Nicoshev
Copy link
Contributor

Changing the check:

length == RUN_MASK

by

token >= (RUN_MASK << ML_BITS)

This allows the branch to be evaluated without needing to compute

length = token >> ML_BITS;

Changing the check:

length == RUN_MASK

by

token >= (RUN_MASK << ML_BITS)

This allows the branch to be evaluated without needing to compute 

length = token >> ML_BITS;
@Cyan4973
Copy link
Member
Cyan4973 commented May 1, 2025

I understand the proposed logic, and why it seems in theory a small advantage.
That being said, compilers are pretty powerful these days, and can change the underlying assembler pretty wildly, so it's hard to tell if such an optimization remains valid after all these transformations. Best is to measure.

Here is what I get by comparing dev (left) with this PR (right) on a i7-9700k using a collection of compilers :

compile clean lz4 with gcc-9                                                   │compile clean lz4 with gcc-9
2e3730f45fcfbb7522f9331de4b49f42  lz4                                          │107dd30a285af0b95786ec4aa902f2c3  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4538.2 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3923.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4500.5 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3686.8 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4062.8 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3374.8 MB/s
compile clean lz4 with gcc-10                                                  │compile clean lz4 with gcc-10
86b1ef2cc78ca62b360c7cc6edbfbeef  lz4                                          │d58b85dfe215e11ba2f60ea4e2797554  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3952.0 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3907.7 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3692.1 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3678.0 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3387.7 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3373.0 MB/s
compile clean lz4 with gcc-11                                                  │compile clean lz4 with gcc-11
ea55986f1b1b0c8137ab53f0698fe9b1  lz4                                          │824b357a5cd631c61ee19dfbfa4d0678  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4540.3 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3924.5 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4505.8 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3696.7 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4061.3 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3377.3 MB/s
compile clean lz4 with gcc-12                                                  │compile clean lz4 with gcc-12
fa631590557ef677006d345021b539da  lz4                                          │ce8db3075e7eb7a51436f8918378c9b3  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3893.5 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3893.0 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3669.4 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3664.0 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3362.3 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3358.8 MB/s
compile clean lz4 with gcc-13                                                  │compile clean lz4 with gcc-13
a41b84f306264f95d7eba9358819d3d6  lz4                                          │d5c94997fc88b26aeeb8ac379137bf1e  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 3583.4 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4559.8 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3260.3 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4493.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3008.8 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4033.3 MB/s
compile clean lz4 with clang-14                                                │compile clean lz4 with clang-14
770c4631ba63e46492e342058c0aed62  lz4                                          │2cd4569d1ce2608ba7be71998fdc5dd2  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4462.4 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4509.5 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4390.3 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4487.0 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3992.0 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4052.1 MB/s
compile clean lz4 with clang-15                                                │compile clean lz4 with clang-15
133f6248b30a64ead2fa23e4907de0e4  lz4                                          │d27c7b5635a6b404713869e1f07fff60  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4305.9 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4318.2 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4218.7 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4220.2 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3809.7 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3817.1 MB/s
compile clean lz4 with clang-16                                                │compile clean lz4 with clang-16
117fdfc5785d31f81fc3cba9f7f29b4b  lz4                                          │769db30d69598a6d4a4f719f2d1bfcb5  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4215.6 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4071.1 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4154.4 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3917.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3752.0 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3564.4 MB/s
compile clean lz4 with clang-17                                                │compile clean lz4 with clang-17
695ea6cc05a181f7bd5ce61e39a4533e  lz4                                          │56150fa62e64c402a3645d06adfdbef8  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4500.4 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4502.4 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4498.2 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 4493.5 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4043.9 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 4036.4 MB/s
compile clean lz4 with clang-18                                                │compile clean lz4 with clang-18
630f86bb6af9fada8e37b7c8553da78b  lz4                                          │698e3550b711a6c801b12ca597462a12  lz4
Benchmark Decompression of LZ4 Frame _without_ checksum even when present      │Benchmark Decompression of LZ4 Frame _without_ checksum even when present
 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4127.0 MB/s │ 1#lesia.tar.L12.lz4 : 211957760 ->  77382713 (2.739),   0.0 MB/s, 4091.5 MB/s
 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3942.7 MB/s │ 1#lgary.tar.L12.lz4 :   3153920 ->   1185988 (2.659),   0.0 MB/s, 3929.3 MB/s
 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3588.1 MB/s │ 1#enwik9.L12.lz4    :1000000000 -> 372443347 (2.685),   0.0 MB/s, 3570.3 MB/s

It's a lot of traces, and it's a bit confusing because the impact goes in all directions.
That's because the noise introduced by random instruction alignment is larger than the difference offered by the patch.
So it can be difficult to settle on the real impact.

Here is a (rough) summary:

gcc-9: -600
gcc-10: -50
gcc-11: -600
gcc-12: 0
gcc-13: +1000
clang-14: +50
clang-15: 0
clang-16: -200
clang-17: 0
clang-18: -30

"Average" seems slightly negative, but given the level of noise, I wouldn't buy too much into that.
Let's just say "inconclusive".

A nice secondary observation is that clang seems more stable (than gcc), both across commits and across versions. But even there, I observe a ~500 MB/s difference between fastest and slowest versions, so that's still significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0