8000 Fixed AVX2 and SSSE3 speedup by mstembera · Pull Request #3261 · official-stockfish/Stockfish · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fixed AVX2 and SSSE3 speedup #3261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mstembera
Copy link
Contributor

STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106

This is a fixed version of #3222 which guarantees no saturation.
(Combining this with moving the first iteration outside of the loop is worse.)

No functional change
bench: 4278746

MichaelB7 reacted with thumbs up emoji
@vondele
Copy link
Member
vondele commented Dec 12, 2020

I locally measure about 5% speedup, and the correct bench on the specially constructed net that triggered saturation before https://github.com/official-stockfish/Stockfish/files/5510407/nn-new.nnue.gz so that looks nice.

maybe you can describe the idea behind the canSaturate16 calculation ?

I'd appreciate some additional review by those knowledgeable of this part of the code.

@mstembera
Copy link
Contributor Author
mstembera commented Dec 12, 2020

@vondele Thanks for testing!
There are always 4 weight * input products that get combined using 16 bits before we upconvert to 32 bits. They are always the 1st and 2nd w's of the first vector and 1st and 2nd w's of the second vector. Then the 3rd and 4th w's of the first vector and the 3rd and 4th w's of the second vector and so on. The AVX2 vectors are 32 w's wide and SSSE3 vectors 16 w's wide. Since the inputs are not known at the time of the calculation we have to assume the worst case scenario of the inputs being 128 and no positive and negative weights ever canceling out since any canceling input could be 0. The formulas then are (w0 + w1 + w2 + w3) * 128 < 32768 for positive w's and (w0 + w1 + w2 + w3) * 128 > -32768 for negative w's. So the sum of the positive weights has to be less than 256 and the negative weights greater than -256.

@mstembera mstembera force-pushed the dpbusd2xPR2 branch 2 times, most recently from 0870bd1 to f372aa8 Compare December 13, 2020 04:51
@mstembera
Copy link
Contributor Author

AVX512 is now also implemented.

@ddobbelaere
Copy link
Contributor

In #3222 you mention a max input value of 127, but here allude to 128. Is this to be on the safe side w.r.t. the "can saturate" bounds?

If 127 actually is the max value, I think you can safely check if sum of positive weights is smaller than or equal to 256, and mutatis mutandis for negative weights.

@mstembera
Copy link
Contributor Author

@ddobbelaere Thanks for reviewing. You are correct. I should have reread my old PR. Since 127 * 258 < 2^15 - 1 the sum of the weights can be as high as 258. I have updated the PR.

@ddobbelaere
Copy link
Contributor
ddobbelaere commented Dec 13, 2020

Cool, you are right about 258. Maybe this is a bit out of scope of current PR, but IMHO ideally we want to static assert check the bounds, s.t. if the 127 clamp value of the relu layer in https://github.com/official-stockfish/Stockfish/blob/0d91d41c3a19c1fa192afbd9c98e141bade0626b/src/nnue/layers/clipped_relu.h#L155 changes, we don't get into trouble without being aware.

If we want to do this, I'm not so sure what the best approach would be though. Maybe extend the layer classes with static constexpr "max/min output" functions?

@syzygy1
Copy link
Contributor
syzygy1 commented Dec 13, 2020

I haven't yet tried to understand the details, but could it help to permute the weights to avoid cases of potential saturation?

@syzygy1
Copy link
Contributor
syzygy1 commented Dec 13, 2020

If canSaturate16[] is only accessed four bools at a time using a cast to (uint32_t *), it may be nicer to merge those four bools into a single bool. (This would also avoid theoretical aliasing problems.)

@vondele
Copy link
Member
vondele commented Dec 13, 2020

I haven't yet tried to understand the details, but could it help to permute the weights to avoid cases of potential saturation?

with the current net, it seems can saturate is never true.
Edit: with the net designed to trigger saturation, it is true 20 out of 130 cases.

No functional change
bench: 4278746
@mstembera
Copy link
Contributor Author

@ddobbelaere
It's a reasonable suggestion but I agree probably outside of the scope for this PR. The current implementations in master also depend on input clamping since the _mm256_maddubs_epi16() instructions already sum two adjacent 16 bit values which w/o input clamping could saturate.

@syzygy1
canSaturate16 isn't really necessary but only serves to alleviate any theoretical concerns with the original PR. As @vondele points out for normal nets it's always false. If future nets start saturating I agree permuting the weights would be worth exploring.

I chose to calculate canSaturate16 as one bool per weight row since that's most straight forward. Processing 4 rows at a time seems like a detail of the current Propagate() implementation so I didn't want to add a dependence on that. However see the updated version.

@vondele
I pushed a cleaner alternative to the "if (!(const uint32_t)&canSaturate16[i])" lines.

@vondele vondele added the to be merged Will be merged shortly label Dec 14, 2020
@vondele vondele closed this in d862ba4 Dec 14, 2020
}

*outptr = m128_haddx4(sum0, sum1, sum2, sum3, bias);
}
}
else if constexpr (kOutputDimensions == 1)
{
__m128i sum0 = _mm_setzero_si128();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it necessary to touch this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaximMolchanov Technically no but I didn't want to leave it inconsistent. Saving one _mm_setzero_si128() should not be measurable. I really didn't want to remove any of the first initialization that were outside of the loops but I couldn't make it work. See https://tests.stockfishchess.org/tests/view/5fd1ef601ac169120188836a which I tested first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this test. Maybe there is no need to initialize with first value then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean even if it passes STC. Too much code for too small advantage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passed STC I would say it's worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried non-zero initialization locally under SSSE3 and didn't succeed (checked using pyshbench). As for me it is some kind of magic: I get slow down even after obvious logical changes and after removing some code.

@syzygy1
Copy link
Contributor
syzygy1 commented Dec 16, 2020

I noticed that the code (at least in Cfish) is very sensitive to register spilling. This optimisation slows down Cfish-gcc a bit (by introducing spilling for no good reason) and is a big speed up for Cfish-clang (presumably in part by removing spilling for no good reason).

@mstembera
Copy link
Contributor Author

@syzygy1 I have little experience with register spilling. How do you detect it? Do you suggest any code improvements as a result?

@syzygy1
Copy link
Contributor
syzygy1 commented Dec 17, 2020

@mstembera
I had a look at the assembly output to understand why Cfish slowed down. One way to compare the amount of register spilling between different versions is to compare the amounts by which the stack pointer is decremented on entry of nnue_evaluate() (assuming most of the code gets inlined, I don't know if that is the case in SF).

I have now fixed the problem by removing the special treatment of the first iteration. I don't know why, but now gcc keeps everything in registers again, and your optimisation is a clear speed up.

I have no suggestions on how to imrpove the SF code to avoid register spilling, but it is good to be aware that register spilling can be an issue when testing a change that clearly should speed up the NNUE code but doesn't.

Anyway, well done 👍

@mstembera
Copy link
Contributor Author

@syzygy1
Thanks. I think it may explain why my first test with the first iterations kept outside of the loops failed.
https://tests.stockfishchess.org/tests/view/5fd1ef601ac169120188836a
I noticed you moved up the register declaration in UpdateAccumulator() on line 244 with a spill register comment which is why I wondered if something similar would be useful for Propagate(). BTW, I like speed optimizations and it's my understanding CFish is quite a bit faster than SF. Is that more due to specific code rewrites or to generally better compiler optimization for C over C++? If you think SF would benefit from back porting some things back from CFish I'd welcome pointers to any relevant methods.

@syzygy1
Copy link
Contributor
syzygy1 commented Dec 18, 2020

Indeed, I encountered the same problem when implementing the incremental updating improvement. At first I could not get a speed up at all, but luckily I figured out the reason before giving up on the idea.
Hopefully this problem will be fixed in a future gcc version. Perhaps I will file a bug report, but it will take some time to create a minimal test case.

Regarding Cfish vs Stockfish, I don't think C instead of C++ makes a big difference.
Cfish's Position struct includes the Thread data, and Stack includes SF's StateInfo and CheckInfo. The Stack structs are in a linear array instead of a linked (StateInfo) list.
Cfish here and there uses narrower types than Stockfish, which should improve caching behaviour.
Cfish's search does not place generated moves on the stack but adds them to a global list, which should also be more cache friendly (no need to allocate space for MAX_MOVES moves).
in Cfish I generally prefer an array lookup over a calculation, in particular when the array is anyway quite small, but also for BetweenBB[]. When I wrote my TB generator, I found out that looping over (the equivalent of) SquareBB[] was faster than starting with bb = 1 and shifting to the right on each iteration (bb <<= 1).

Some of the optimisations in Cfish are undoubtedly too dirty for Stockfish, e.g. updating the rule50 and pliesFromNull counters by combining them into a single uint16_t and adding 0x0101.

Cfish has two NNUE implementations: nnue-regular.c (similar to SF's) and nnue-sparse.c. (Compile option sparse=yes/no.)
nnue-sparse.c is a big win for SSE2 and MMX (and also for at least certain ARM devices). It also seems to be a win for all AMD cpus before Zen 3. (Zen 3 seems to have an improved AVX2 implementation which favours nnue-regular.c.) For SSSE3 and SSE41 the two seem very close (depending on the cpu). nnue-regular is faster on Intel AVX2 and up.

Implementing nnue-sparse.c in SF is not straightforward.
nnue-regular.c has optimisations for AVX2 and AVX512. Most of the permutations in SF that compensate for the per-lane behaviour of AVX2 and AVX512 can be removed by setting up the weights matrix in the right way.

@MaximMolchanov
Copy link
Contributor
MaximMolchanov commented Dec 18, 2020

What do you think about next ideas?

  • Use even more intermediate 16-bit calculations (bench is not changed even if the entire row is calculated in 16-bit).
  • Return false in ReadParameters if at least one saturation is possible. We will deny some specific net but IMHO such net will never exist. No annoying (and useless) checks in ~6 places in the code (don't forget that current master doesn't have useful saturation check in kOutputDimensions == 1 case) as a result.

@mstembera
Copy link
Contributor Author

@syzygy1
Thanks for detailing all the various enhancements! Should be plenty to investigate. I hope other devs have a read as well.

@MaximMolchanov
I already tried using 3 and 4 intermediates but they both failed.
https://tests.stockfishchess.org/tests/view/5fd9a81c0c58709243620000
https://tests.stockfishchess.org/tests/view/5fd8299c0c5870924361ff25
I wonder if the register spilling discussed above by @syzygy1 may play a role.

As far as not loading nets that saturate I'm not sure how likely such nets will be generated in the future. Anyway that would not be my decision.

joergoster pushed a commit to joergoster/Stockfish-old that referenced this pull request Dec 19, 2020
Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit.

Potential saturation is detected and the code-path is avoided in this case.
The saturation can't happen with the current nets,
but nets can be constructed that trigger this check.

STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106

about 5% speedup

closes official-stockfish/Stockfish#3261

No functional change
@mstembera
Copy link
Contributor Author

To anyone curious, I just tested a version that uses16 bits for entire rows. The AVX2 compile takes the fast path in all cases. However it surprisingly failed badly https://tests.stockfishchess.org/tests/view/5fe328533932f79192d3963a and I'm not sure why.
My local bench is about the same speed as master.

@MaximMolchanov
Copy link
Contributor

I also tested 4 intermediate results: https://tests.stockfishchess.org/tests/view/5fe013d93932f79192d394d0
It differs from your test because there is also 2 intermediate results left.
My local bench shows confident speedup.
Also non-zero init gave some speedup now (in new master).
I also surprised about these failings and dependence on not related changes in master.

@MaximMolchanov
Copy link
Contributor
MaximMolchanov commented Dec 23, 2020

@mstembera Here are my bench results with your last test (base is master, test is yours):

run       base       test     diff
  1    1704971    1637117   -67854
  2    1700486    1667278   -33208
  3    1699847    1633571   -66276
  4    1653872    1618381   -35491
  5    1635934    1610891   -25043
  6    1692852    1650855   -41997
  7    1712715    1678410   -34305
  8    1705614    1682779   -22835
  9    1708189    1660548   -47641
 10    1699847    1659939   -39908

Result of  10 runs
==================
base (...rc/stockfish) =    1691433  +/- 15758
test (...rc/stockfish) =    1649977  +/- 15098
diff                   =     -41456  +/- 9509

speedup        = -0.0245
P(speedup > 0) =  0.0000

CPU: 4 x Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
Hyperthreading: on 

@mstembera
Copy link
Contributor Author

@MaximMolchanov Thanks for testing. I measured your patch as well and got the below although my local benches seem to have a lot of variance lately.

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1168069   1178170   -10101    
    StDev   99011     93977     150125    

p-value: 0.527
speedup: 0.009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0