-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Fixed AVX2 and SSSE3 speedup #3261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I locally measure about 5% speedup, and the correct bench on the specially constructed net that triggered saturation before https://github.com/official-stockfish/Stockfish/files/5510407/nn-new.nnue.gz so that looks nice. maybe you can describe the idea behind the canSaturate16 calculation ? I'd appreciate some additional review by those knowledgeable of this part of the code. |
@vondele Thanks for testing! |
0870bd1
to
f372aa8
Compare
AVX512 is now also implemented. |
In #3222 you mention a max input value of 127, but here allude to 128. Is this to be on the safe side w.r.t. the "can saturate" bounds? If 127 actually is the max value, I think you can safely check if sum of positive weights is smaller than or equal to 256, and mutatis mutandis for negative weights. |
f372aa8
to
0d91d41
Compare
@ddobbelaere Thanks for reviewing. You are correct. I should have reread my old PR. Since 127 * 258 < 2^15 - 1 the sum of the weights can be as high as 258. I have updated the PR. |
Cool, you are right about 258. Maybe this is a bit out of scope of current PR, but IMHO ideally we want to static assert check the bounds, s.t. if the 127 clamp value of the relu layer in https://github.com/official-stockfish/Stockfish/blob/0d91d41c3a19c1fa192afbd9c98e141bade0626b/src/nnue/layers/clipped_relu.h#L155 changes, we don't get into trouble without being aware. If we want to do this, I'm not so sure what the best approach would be though. Maybe extend the layer classes with static constexpr "max/min output" functions? |
I haven't yet tried to understand the details, but could it help to permute the weights to avoid cases of potential saturation? |
If canSaturate16[] is only accessed four bools at a time using a cast to (uint32_t *), it may be nicer to merge those four bools into a single bool. (This would also avoid theoretical aliasing problems.) |
with the current net, it seems can saturate is never true. |
No functional change bench: 4278746
0d91d41
to
6eeddb2
Compare
@ddobbelaere @syzygy1 I chose to calculate canSaturate16 as one bool per weight row since that's most straight forward. Processing 4 rows at a time seems like a detail of the current Propagate() implementation so I didn't want to add a dependence on that. However see the updated version. @vondele |
} | ||
|
||
*outptr = m128_haddx4(sum0, sum1, sum2, sum3, bias); | ||
} | ||
} | ||
else if constexpr (kOutputDimensions == 1) | ||
{ | ||
__m128i sum0 = _mm_setzero_si128(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it necessary to touch this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaximMolchanov Technically no but I didn't want to leave it inconsistent. Saving one _mm_setzero_si128() should not be measurable. I really didn't want to remove any of the first initialization that were outside of the loops but I couldn't make it work. See https://tests.stockfishchess.org/tests/view/5fd1ef601ac169120188836a which I tested first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this test. Maybe there is no need to initialize with first value then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean even if it passes STC. Too much code for too small advantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passed STC I would say it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried non-zero initialization locally under SSSE3 and didn't succeed (checked using pyshbench). As for me it is some kind of magic: I get slow down even after obvious logical changes and after removing some code.
I noticed that the code (at least in Cfish) is very sensitive to register spilling. This optimisation slows down Cfish-gcc a bit (by introducing spilling for no good reason) and is a big speed up for Cfish-clang (presumably in part by removing spilling for no good reason). |
@syzygy1 I have little experience with register spilling. How do you detect it? Do you suggest any code improvements as a result? |
@mstembera I have now fixed the problem by removing the special treatment of the first iteration. I don't know why, but now gcc keeps everything in registers again, and your optimisation is a clear speed up. I have no suggestions on how to imrpove the SF code to avoid register spilling, but it is good to be aware that register spilling can be an issue when testing a change that clearly should speed up the NNUE code but doesn't. Anyway, well done 👍 |
@syzygy1 |
Indeed, I encountered the same problem when implementing the incremental updating improvement. At first I could not get a speed up at all, but luckily I figured out the reason before giving up on the idea. Regarding Cfish vs Stockfish, I don't think C instead of C++ makes a big difference. Some of the optimisations in Cfish are undoubtedly too dirty for Stockfish, e.g. updating the rule50 and pliesFromNull counters by combining them into a single uint16_t and adding 0x0101. Cfish has two NNUE implementations: nnue-regular.c (similar to SF's) and nnue-sparse.c. (Compile option sparse=yes/no.) Implementing nnue-sparse.c in SF is not straightforward. |
What do you think about next ideas?
|
@syzygy1 @MaximMolchanov As far as not loading nets that saturate I'm not sure how likely such nets will be generated in the future. Anyway that would not be my decision. |
Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit. Potential saturation is detected and the code-path is avoided in this case. The saturation can't happen with the current nets, but nets can be constructed that trigger this check. STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 25544 W: 2451 L: 2296 D: 20797 Ptnml(0-2): 92, 1761, 8925, 1888, 106 about 5% speedup closes official-stockfish/Stockfish#3261 No functional change
To anyone curious, I just tested a version that uses16 bits for entire rows. The AVX2 compile takes the fast path in all cases. However it surprisingly failed badly https://tests.stockfishchess.org/tests/view/5fe328533932f79192d3963a and I'm not sure why. |
I also tested 4 intermediate results: https://tests.stockfishchess.org/tests/view/5fe013d93932f79192d394d0 |
@mstembera Here are my bench results with your last test (base is master, test is yours):
|
@MaximMolchanov Thanks for testing. I measured your patch as well and got the below although my local benches seem to have a lot of variance lately.
|
STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106
This is a fixed version of #3222 which guarantees no saturation.
(Combining this with moving the first iteration outside of the loop is worse.)
No functional change
bench: 4278746