8000 Hyperloglog ARM NEON SIMD optimization by xbasel · Pull Request #1859 · valkey-io/valkey · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Hyperloglog ARM NEON SIMD optimization #1859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 1, 2025

Conversation

xbasel
Copy link
Member
@xbasel xbasel commented Mar 18, 2025

Add ARM NEON optimization for HyperLogLog

  • Implement two NEON optmized functions for converting between raw and
    dense representations in HyperLogLog:

    1. hllMergeDenseNEON
    2. hllDenseCompressNEON
      These functions process 16 registers in each iteration.
  • Utilize existing SIMD test in hyperloglog.tcl (previously added for
    AVX2 optimization) to validate NEON implementation

Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}

+-------------------+-----------+-----------+---------------+
|      Metric       |  Before   |   After   | Improvement % |
+-------------------+-----------+-----------+---------------+
| Throughput (k rps)|    7.42   |   76.98   |    937.47%    |
+-------------------+-----------+-----------+---------------+
| Latency (msec)    |           |           |               |
|   avg             |   6.686   |   0.595   |     91.10%    |
|   min             |   0.520   |   0.152   |     70.77%    |
|   p50             |   7.799   |   0.599   |     92.32%    |
|   p95             |   8.039   |   0.767   |     90.46%    |
|   p99             |   8.111   |   0.807   |     90.05%    |
|   max             |   9.263   |   1.463   |     84.21%    |
+-------------------+-----------+-----------+---------------+

Hardware:

CPU: Graviton 3
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-63
Memory: 256 GB

Command stats:
Before:

cmdstat_pfmerge:calls=1000002,usec=126327984,**usec_per_call=126.33**,rejected_calls=0,failed_calls=0

After:

cmdstat_pfmerge:calls=1000002,usec=8588205,**usec_per_call=8.59**,rejected_calls=0,failed_calls=0

Improved by ~14.7x.

Functional testing command:

./runtest --single unit/hyperloglog --only "PFMERGE results with simd"  --loops 10000  --fastfail

The SIMD test randomizes input and comapres scalar vs simd results.

Copy link
codecov bot commented Mar 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.95%. Comparing base (8df0a6b) to head (1fcf2b9).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1859      +/-   ##
============================================
- Coverage     70.96%   70.95%   -0.01%     
============================================
  Files           123      123              
  Lines         66135    66133       -2     
============================================
- Hits          46934    46926       -8     
- Misses        19201    19207       +6     
Files with missing lines Coverage Δ
src/hyperloglog.c 92.20% <100.00%> (-0.03%) ⬇️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xbasel xbasel force-pushed the hll_neon branch 5 times, most recently from d5cc649 to b2c857e Compare March 18, 2025 16:41
@xbasel xbasel self-assigned this Mar 18, 2025
- Implement two NEON optmized functions for converting between raw and
  dense representations in HyperLogLog:
  1. hllMergeDenseNEON
  2. hllDenseCompressNEON
  These functions process 16 registers in each iteration.

- Utilize existing SIMD test in hyperloglog.tcl (previously added for
  AVX2 optimization) to validate NEON implementation

Test:
  valkey-benchmark -n 1000000 --dbnum  9  -p 21111 PFMERGE z hll1{t} hll2{t}

+-------------------+-----------+-----------+---------------+
|      Metric       |  Before   |   After   | Improvement % |
+-------------------+-----------+-----------+---------------+
| Throughput (k rps)|    7.42   |   76.98   |    937.47%    |
+-------------------+-----------+-----------+---------------+
| Latency (msec)    |           |           |               |
|   avg             |   6.686   |   0.595   |     91.10%    |
|   min             |   0.520   |   0.152   |     70.77%    |
|   p50             |   7.799   |   0.599   |     92.32%    |
|   p95             |   8.039   |   0.767   |     90.46%    |
|   p99             |   8.111   |   0.807   |     90.05%    |
|   max             |   9.263   |   1.463   |     84.21%    |
+-------------------+-----------+-----------+---------------+

Hardware:
CPU: Graviton 3
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-63
Memory: 256 GB

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
@xbasel xbasel marked this pull request as ready for review March 18, 2025 16:56
Copy link
Contributor
@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

@xbasel
Copy link
Member Author
xbasel commented Mar 21, 2025

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

It is similar. NEON vectors are 128 bit, AVX2 is 256 bit. The padding and lookup is a bit different in AVX2.
The execution time of pfmerge ~14.7x faster. The end to end is ~10x faster.

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
@xbasel xbasel requested a review from zuiderkwast April 10, 2025 12:58
Comment on lines 234 to 240
#ifdef __ARM_NEON
static int simd_enabled = 1;
#define HLL_USE_NEON (simd_enabled)
#else
#define HLL_USE_NEON 0
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need add a runtime check here when server startup ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've limited this to aarch64, which is guaranteed to have neon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is guaranteed to have NEON support in the compile environment. Is it possible to run AArch64 binaries on a non-neon platform?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, all AArch64 include NEON. AArch64 was new in ARMv8. https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8. (Even older ARM 32-bit have NEON but we don't care about those for SIMD.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.

Copy link
Member Author
@xbasel xbasel May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.

https://developer.arm.com/documentation/102474/0100/Fundamentals-of-Armv8-Neon-technology

AArch64 is the name used to describe the 64-bit Execution state of the Armv8-A architecture. In AArch64 state, the processor executes the A64 instruction set, which contains Neon instructions (also referred to as SIMD instructions). GNU and Linux documentation sometimes refers to AArch64 as ARM64.

Older architectures do not support AArch64 execution state, so the binary won't run.

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
@xbasel xbasel requested a review from lipzhu April 29, 2025 14:17
@xbasel xbasel marked this pull request as draft April 29, 2025 16:17
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
@xbasel xbasel requested a review from zuiderkwast April 29, 2025 18:57
@xbasel xbasel marked this pull request as ready for review April 29, 2025 18:58
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
@xbasel xbasel requested a review from zuiderkwast April 30, 2025 08:56
Copy link
Contributor
@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I didn't review the actual SIMD code very carefully, but we have a test to compare the results with/without SIMD so I think it's safe.

I'll wait for @lipzhu's approval before merge.

@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label Apr 30, 2025
@xbasel
Copy link
Member Author
xbasel commented Apr 30, 2025

LGTM.

I didn't review the actual SIMD code very carefully, but we have a test to compare the results with/without SIMD so I think it's safe.

I'll wait for @lipzhu's approval before merge.

The code is actually already being tested with SIMD vs Scalar in tests/unit/hyperloglog.tcl. See PFMERGE results with simd test.
All you need to do is to run the test on ARM.

➜  valkey git:(hll_neon) ✗ ./runtest --single unit/hyperloglog --only "PFMERGE results with simd"
Cleanup: may take some time... OK
..
[ok]: PFMERGE results with simd (457 ms)
..
[1/1 done]: unit/hyperloglog (1 seconds)

                   The End

Execution time of different units:
  1 seconds - unit/hyperloglog

\o/ All tests passed without errors!

Cleanup: may take some time... OK
➜  valkey git:(hll_neon) ✗ cat /proc/cpuinfo
processor	: 0
BogoMIPS	: 2100.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer	: 0x41
CPU architecture: 8

This is an ARMv8-A 64-bit CPU (and supports NEON).

@zuiderkwast
Copy link
Contributor

The code is actually already being tested with SIMD vs Scalar in tests/unit/hyperloglog.tcl. See PFMERGE results with simd test.
All you need to do is to run the test on ARM.

Yeah, I know. That's why I said "we have a test to compare the results with/without SIMD so I think it's safe".

@zuiderkwast zuiderkwast merged commit dd772c4 into valkey-io:unstable May 1, 2025
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0