Fixed AVX2 and SSSE3 speedup #3261

mstembera · 2020-12-12T22:28:42Z

STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479
LLR: 2.94 (-2.94,2.94) {-0.25,1.25}
Total: 25544 W: 2451 L: 2296 D: 20797
Ptnml(0-2): 92, 1761, 8925, 1888, 106

This is a fixed version of #3222 which guarantees no saturation.
(Combining this with moving the first iteration outside of the loop is worse.)

No functional change
bench: 4278746

vondele · 2020-12-12T22:53:22Z

I locally measure about 5% speedup, and the correct bench on the specially constructed net that triggered saturation before https://github.com/official-stockfish/Stockfish/files/5510407/nn-new.nnue.gz so that looks nice.

maybe you can describe the idea behind the canSaturate16 calculation ?

I'd appreciate some additional review by those knowledgeable of this part of the code.

mstembera · 2020-12-12T23:45:45Z

@vondele Thanks for testing!
There are always 4 weight * input products that get combined using 16 bits before we upconvert to 32 bits. They are always the 1st and 2nd w's of the first vector and 1st and 2nd w's of the second vector. Then the 3rd and 4th w's of the first vector and the 3rd and 4th w's of the second vector and so on. The AVX2 vectors are 32 w's wide and SSSE3 vectors 16 w's wide. Since the inputs are not known at the time of the calculation we have to assume the worst case scenario of the inputs being 128 and no positive and negative weights ever canceling out since any canceling input could be 0. The formulas then are (w0 + w1 + w2 + w3) * 128 < 32768 for positive w's and (w0 + w1 + w2 + w3) * 128 > -32768 for negative w's. So the sum of the positive weights has to be less than 256 and the negative weights greater than -256.

mstembera · 2020-12-13T04:52:03Z

AVX512 is now also implemented.

ddobbelaere · 2020-12-13T08:28:08Z

In #3222 you mention a max input value of 127, but here allude to 128. Is this to be on the safe side w.r.t. the "can saturate" bounds?

If 127 actually is the max value, I think you can safely check if sum of positive weights is smaller than or equal to 256, and mutatis mutandis for negative weights.

mstembera · 2020-12-13T09:39:01Z

@ddobbelaere Thanks for reviewing. You are correct. I should have reread my old PR. Since 127 * 258 < 2^15 - 1 the sum of the weights can be as high as 258. I have updated the PR.

ddobbelaere · 2020-12-13T11:34:03Z

Cool, you are right about 258. Maybe this is a bit out of scope of current PR, but IMHO ideally we want to static assert check the bounds, s.t. if the 127 clamp value of the relu layer in https://github.com/official-stockfish/Stockfish/blob/0d91d41c3a19c1fa192afbd9c98e141bade0626b/src/nnue/layers/clipped_relu.h#L155 changes, we don't get into trouble without being aware.

If we want to do this, I'm not so sure what the best approach would be though. Maybe extend the layer classes with static constexpr "max/min output" functions?

syzygy1 · 2020-12-13T12:50:18Z

I haven't yet tried to understand the details, but could it help to permute the weights to avoid cases of potential saturation?

syzygy1 · 2020-12-13T13:11:33Z

If canSaturate16[] is only accessed four bools at a time using a cast to (uint32_t *), it may be nicer to merge those four bools into a single bool. (This would also avoid theoretical aliasing problems.)

vondele · 2020-12-13T15:22:44Z

I haven't yet tried to understand the details, but could it help to permute the weights to avoid cases of potential saturation?

with the current net, it seems can saturate is never true.
Edit: with the net designed to trigger saturation, it is true 20 out of 130 cases.

src/nnue/layers/affine_transform.h

No functional change bench: 4278746

mstembera · 2020-12-13T23:30:46Z

@ddobbelaere
It's a reasonable suggestion but I agree probably outside of the scope for this PR. The current implementations in master also depend on input clamping since the _mm256_maddubs_epi16() instructions already sum two adjacent 16 bit values which w/o input clamping could saturate.

@syzygy1
canSaturate16 isn't really necessary but only serves to alleviate any theoretical concerns with the original PR. As @vondele points out for normal nets it's always false. If future nets start saturating I agree permuting the weights would be worth exploring.

I chose to calculate canSaturate16 as one bool per weight row since that's most straight forward. Processing 4 rows at a time seems like a detail of the current Propagate() implementation so I didn't want to add a dependence on that. However see the updated version.

@vondele
I pushed a cleaner alternative to the "if (!(const uint32_t)&canSaturate16[i])" lines.

MaximMolchanov · 2020-12-15T16:47:25Z

src/nnue/layers/affine_transform.h

          }

          *outptr = m128_haddx4(sum0, sum1, sum2, sum3, bias);
        }
      }
      else if constexpr (kOutputDimensions == 1)
      {
+        __m128i sum0 = _mm_setzero_si128();


Was it necessary to touch this case?

@MaximMolchanov Technically no but I didn't want to leave it inconsistent. Saving one _mm_setzero_si128() should not be measurable. I really didn't want to remove any of the first initialization that were outside of the loops but I couldn't make it work. See https://tests.stockfishchess.org/tests/view/5fd1ef601ac169120188836a which I tested first.

Thanks for sharing this test. Maybe there is no need to initialize with first value then.

I mean even if it passes STC. Too much code for too small advantage.

If it passed STC I would say it's worth it.

I've tried non-zero initialization locally under SSSE3 and didn't succeed (checked using pyshbench). As for me it is some kind of magic: I get slow down even after obvious logical changes and after removing some code.

syzygy1 · 2020-12-16T16:20:13Z

I noticed that the code (at least in Cfish) is very sensitive to register spilling. This optimisation slows down Cfish-gcc a bit (by introducing spilling for no good reason) and is a big speed up for Cfish-clang (presumably in part by removing spilling for no good reason).

mstembera · 2020-12-16T21:56:36Z

@syzygy1 I have little experience with register spilling. How do you detect it? Do you suggest any code improvements as a result?

syzygy1 · 2020-12-17T00:34:25Z

@mstembera
I had a look at the assembly output to understand why Cfish slowed down. One way to compare the amount of register spilling between different versions is to compare the amounts by which the stack pointer is decremented on entry of nnue_evaluate() (assuming most of the code gets inlined, I don't know if that is the case in SF).

I have now fixed the problem by removing the special treatment of the first iteration. I don't know why, but now gcc keeps everything in registers again, and your optimisation is a clear speed up.

I have no suggestions on how to imrpove the SF code to avoid register spilling, but it is good to be aware that register spilling can be an issue when testing a change that clearly should speed up the NNUE code but doesn't.

Anyway, well done 👍

mstembera · 2020-12-17T09:23:57Z

@syzygy1
Thanks. I think it may explain why my first test with the first iterations kept outside of the loops failed.
https://tests.stockfishchess.org/tests/view/5fd1ef601ac169120188836a
I noticed you moved up the register declaration in UpdateAccumulator() on line 244 with a spill register comment which is why I wondered if something similar would be useful for Propagate(). BTW, I like speed optimizations and it's my understanding CFish is quite a bit faster than SF. Is that more due to specific code rewrites or to generally better compiler optimization for C over C++? If you think SF would benefit from back porting some things back from CFish I'd welcome pointers to any relevant methods.

syzygy1 · 2020-12-18T00:07:30Z

Indeed, I encountered the same problem when implementing the incremental updating improvement. At first I could not get a speed up at all, but luckily I figured out the reason before giving up on the idea.
Hopefully this problem will be fixed in a future gcc version. Perhaps I will file a bug report, but it will take some time to create a minimal test case.

Regarding Cfish vs Stockfish, I don't think C instead of C++ makes a big difference.
Cfish's Position struct includes the Thread data, and Stack includes SF's StateInfo and CheckInfo. The Stack structs are in a linear array instead of a linked (StateInfo) list.
Cfish here and there uses narrower types than Stockfish, which should improve caching behaviour.
Cfish's search does not place generated moves on the stack but adds them to a global list, which should also be more cache friendly (no need to allocate space for MAX_MOVES moves).
in Cfish I generally prefer an array lookup over a calculation, in particular when the array is anyway quite small, but also for BetweenBB[]. When I wrote my TB generator, I found out that looping over (the equivalent of) SquareBB[] was faster than starting with bb = 1 and shifting to the right on each iteration (bb <<= 1).

Some of the optimisations in Cfish are undoubtedly too dirty for Stockfish, e.g. updating the rule50 and pliesFromNull counters by combining them into a single uint16_t and adding 0x0101.

Cfish has two NNUE implementations: nnue-regular.c (similar to SF's) and nnue-sparse.c. (Compile option sparse=yes/no.)
nnue-sparse.c is a big win for SSE2 and MMX (and also for at least certain ARM devices). It also seems to be a win for all AMD cpus before Zen 3. (Zen 3 seems to have an improved AVX2 implementation which favours nnue-regular.c.) For SSSE3 and SSE41 the two seem very close (depending on the cpu). nnue-regular is faster on Intel AVX2 and up.

Implementing nnue-sparse.c in SF is not straightforward.
nnue-regular.c has optimisations for AVX2 and AVX512. Most of the permutations in SF that compensate for the per-lane behaviour of AVX2 and AVX512 can be removed by setting up the weights matrix in the right way.

MaximMolchanov · 2020-12-18T01:57:55Z

What do you think about next ideas?

Use even more intermediate 16-bit calculations (bench is not changed even if the entire row is calculated in 16-bit).
Return false in ReadParameters if at least one saturation is possible. We will deny some specific net but IMHO such net will never exist. No annoying (and useless) checks in ~6 places in the code (don't forget that current master doesn't have useful saturation check in kOutputDimensions == 1 case) as a result.

mstembera · 2020-12-18T05:28:49Z

@syzygy1
Thanks for detailing all the various enhancements! Should be plenty to investigate. I hope other devs have a read as well.

@MaximMolchanov
I already tried using 3 and 4 intermediates but they both failed.
https://tests.stockfishchess.org/tests/view/5fd9a81c0c58709243620000
https://tests.stockfishchess.org/tests/view/5fd8299c0c5870924361ff25
I wonder if the register spilling discussed above by @syzygy1 may play a role.

As far as not loading nets that saturate I'm not sure how likely such nets will be generated in the future. Anyway that would not be my decision.

Improves throughput by summing 2 intermediate dot products using 16 bit addition before upconverting to 32 bit. Potential saturation is detected and the code-path is avoided in this case. The saturation can't happen with the current nets, but nets can be constructed that trigger this check. STC https://tests.stockfishchess.org/tests/view/5fd40a861ac1691201888479 LLR: 2.94 (-2.94,2.94) {-0.25,1.25} Total: 25544 W: 2451 L: 2296 D: 20797 Ptnml(0-2): 92, 1761, 8925, 1888, 106 about 5% speedup closes official-stockfish/Stockfish#3261 No functional change

mstembera · 2020-12-23T11:44:03Z

To anyone curious, I just tested a version that uses16 bits for entire rows. The AVX2 compile takes the fast path in all cases. However it surprisingly failed badly https://tests.stockfishchess.org/tests/view/5fe328533932f79192d3963a and I'm not sure why.
My local bench is about the same speed as master.

MaximMolchanov · 2020-12-23T11:49:59Z

I also tested 4 intermediate results: https://tests.stockfishchess.org/tests/view/5fe013d93932f79192d394d0
It differs from your test because there is also 2 intermediate results left.
My local bench shows confident speedup.
Also non-zero init gave some speedup now (in new master).
I also surprised about these failings and dependence on not related changes in master.

MaximMolchanov · 2020-12-23T12:13:05Z

@mstembera Here are my bench results with your last test (base is master, test is yours):

run       base       test     diff
  1    1704971    1637117   -67854
  2    1700486    1667278   -33208
  3    1699847    1633571   -66276
  4    1653872    1618381   -35491
  5    1635934    1610891   -25043
  6    1692852    1650855   -41997
  7    1712715    1678410   -34305
  8    1705614    1682779   -22835
  9    1708189    1660548   -47641
 10    1699847    1659939   -39908

Result of  10 runs
==================
base (...rc/stockfish) =    1691433  +/- 15758
test (...rc/stockfish) =    1649977  +/- 15098
diff                   =     -41456  +/- 9509

speedup        = -0.0245
P(speedup > 0) =  0.0000

CPU: 4 x Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
Hyperthreading: on

mstembera · 2020-12-23T19:49:21Z

@MaximMolchanov Thanks for testing. I measured your patch as well and got the below although my local benches seem to have a lot of variance lately.

Results for 20 tests for each version:

            Base      Test      Diff      
    Mean    1168069   1178170   -10101    
    StDev   99011     93977     150125    

p-value: 0.527
speedup: 0.009

mstembera force-pushed the dpbusd2xPR2 branch 2 times, most recently from 0870bd1 to f372aa8 Compare December 13, 2020 04:51

mstembera force-pushed the dpbusd2xPR2 branch from f372aa8 to 0d91d41 Compare December 13, 2020 09:36

vondele reviewed Dec 13, 2020

View reviewed changes

src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved

src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved

src/nnue/layers/affine_transform.h Outdated Show resolved Hide resolved

Fixed AVX512, AVX2, and SSSE3 speedups

6eeddb2

No functional change bench: 4278746

mstembera force-pushed the dpbusd2xPR2 branch from 0d91d41 to 6eeddb2 Compare December 13, 2020 23:26

vondele added the to be merged Will be merged shortly label Dec 14, 2020

vondele closed this in d862ba4 Dec 14, 2020

MaximMolchanov reviewed Dec 15, 2020

View reviewed changes

mstembera mentioned this pull request Jan 7, 2021

Affine transform refactoring #3287

Closed

mstembera mentioned this pull request Mar 17, 2024

Slight speedup to NNUE inference #5117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed AVX2 and SSSE3 speedup #3261

Fixed AVX2 and SSSE3 speedup #3261

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fixed AVX2 and SSSE3 speedup #3261

Fixed AVX2 and SSSE3 speedup #3261

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!