Hyperloglog ARM NEON SIMD optimization #1859

xbasel · 2025-03-18T11:50:31Z

Add ARM NEON optimization for HyperLogLog

Implement two NEON optmized functions for converting between raw and
dense representations in HyperLogLog:
1. hllMergeDenseNEON
2. hllDenseCompressNEON
  These functions process 16 registers in each iteration.
Utilize existing SIMD test in hyperloglog.tcl (previously added for
AVX2 optimization) to validate NEON implementation

Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}

+-------------------+-----------+-----------+---------------+
|      Metric       |  Before   |   After   | Improvement % |
+-------------------+-----------+-----------+---------------+
| Throughput (k rps)|    7.42   |   76.98   |    937.47%    |
+-------------------+-----------+-----------+---------------+

8000
| Latency (msec)    |           |           |               |
|   avg             |   6.686   |   0.595   |     91.10%    |
|   min             |   0.520   |   0.152   |     70.77%    |
|   p50             |   7.799   |   0.599   |     92.32%    |
|   p95             |   8.039   |   0.767   |     90.46%    |
|   p99             |   8.111   |   0.807   |     90.05%    |
|   max             |   9.263   |   1.463   |     84.21%    |
+-------------------+-----------+-----------+---------------+

Hardware:

CPU: Graviton 3
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-63
Memory: 256 GB

Command stats:
Before:

cmdstat_pfmerge:calls=1000002,usec=126327984,**usec_per_call=126.33**,rejected_calls=0,failed_calls=0

After:

cmdstat_pfmerge:calls=1000002,usec=8588205,**usec_per_call=8.59**,rejected_calls=0,failed_calls=0

Improved by ~14.7x.

Functional testing command:

./runtest --single unit/hyperloglog --only "PFMERGE results with simd"  --loops 10000  --fastfail

The SIMD test randomizes input and comapres scalar vs simd results.

codecov · 2025-03-18T12:06:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.95%. Comparing base (8df0a6b) to head (1fcf2b9).
Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1859      +/-   ##
============================================
- Coverage     70.96%   70.95%   -0.01%     
============================================
  Files           123      123              
  Lines         66135    66133       -2     
============================================
- Hits          46934    46926       -8     
- Misses        19201    19207       +6

Files with missing lines	Coverage Δ
src/hyperloglog.c	`92.20% <100.00%> (-0.03%)`	⬇️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t} +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ Hardware: CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

zuiderkwast

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

src/hyperloglog.c

xbasel · 2025-03-21T15:14:12Z

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

It is similar. NEON vectors are 128 bit, AVX2 is 256 bit. The padding and lookup is a bit different in AVX2.
The execution time of pfmerge ~14.7x faster. The end to end is ~10x faster.

src/hyperloglog.c

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

lipzhu · 2025-04-14T08:46:16Z

src/hyperloglog.c

+#ifdef __ARM_NEON
+static int simd_enabled = 1;
+#define HLL_USE_NEON (simd_enabled)
+#else
+#define HLL_USE_NEON 0
+#endif
+


Do we need add a runtime check here when server startup ?

I've limited this to aarch64, which is guaranteed to have neon.

It is guaranteed to have NEON support in the compile environment. Is it possible to run AArch64 binaries on a non-neon platform?

IIUC, all AArch64 include NEON. AArch64 was new in ARMv8. https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8. (Even older ARM 32-bit have NEON but we don't care about those for SIMD.)

Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.

Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.

https://developer.arm.com/documentation/102474/0100/Fundamentals-of-Armv8-Neon-technology

AArch64 is the name used to describe the 64-bit Execution state of the Armv8-A architecture. In AArch64 state, the processor executes the A64 instruction set, which contains Neon instructions (also referred to as SIMD instructions). GNU and Linux documentation sometimes refers to AArch64 as ARM64.

Older architectures do not support AArch64 execution state, so the binary won't run.

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

src/hyperloglog.c

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

src/hyperloglog.c

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

zuiderkwast

LGTM.

I didn't review the actual SIMD code very carefully, but we have a test to compare the results with/without SIMD so I think it's safe.

I'll wait for @lipzhu's approval before merge.

xbasel · 2025-04-30T09:42:49Z

LGTM.

I didn't review the actual SIMD code very carefully, but we have a test to compare the results with/without SIMD so I think it's safe.

I'll wait for @lipzhu's approval before merge.

The code is actually already being tested with SIMD vs Scalar in tests/unit/hyperloglog.tcl. See PFMERGE results with simd test.
All you need to do is to run the test on ARM.

➜  valkey git:(hll_neon) ✗ ./runtest --single unit/hyperloglog --only "PFMERGE results with simd"
Cleanup: may take some time... OK
..
[ok]: PFMERGE results with simd (457 ms)
..
[1/1 done]: unit/hyperloglog (1 seconds)

                   The End

Execution time of different units:
  1 seconds - unit/hyperloglog

\o/ All tests passed without errors!

Cleanup: may take some time... OK
➜  valkey git:(hll_neon) ✗ cat /proc/cpuinfo
processor	: 0
BogoMIPS	: 2100.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer	: 0x41
CPU architecture: 8

This is an ARMv8-A 64-bit CPU (and supports NEON).

zuiderkwast · 2025-04-30T11:01:06Z

The code is actually already being tested with SIMD vs Scalar in tests/unit/hyperloglog.tcl. See PFMERGE results with simd test.
All you need to do is to run the test on ARM.

Yeah, I know. That's why I said "we have a test to compare the results with/without SIMD so I think it's safe".

Add ARM NEON optimization for HyperLogLog - Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: ``` valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}``` ``` +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ ``` Hardware: ``` CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB ``` Command stats: Before: ``` cmdstat_pfmerge:calls=1000002,usec=126327984,**usec_per_call=126.33**,rejected_calls=0,failed_calls=0 ``` After: ``` cmdstat_pfmerge:calls=1000002,usec=8588205,**usec_per_call=8.59**,rejected_calls=0,failed_calls=0 ``` Improved by **~14.7x.** Functional testing command: ``` ./runtest --single unit/hyperloglog --only "PFMERGE results with simd" --loops 10000 --fastfail ``` The SIMD test randomizes input and comapres scalar vs simd results. --------- Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

xbasel marked this pull request as draft March 18, 2025 11:50

xbasel mentioned this pull request Mar 18, 2025

[NEW] Implement ARM NEON and SVE2 optimizations for Hyperloglog #1860

Open

xbasel force-pushed the hll_neon branch 5 times, most recently from d5cc649 to b2c857e Compare March 18, 2025 16:41

xbasel self-assigned this Mar 18, 2025

xbasel force-pushed the hll_neon branch from b2c857e to 4c45315 Compare March 18, 2025 16:56

xbasel marked this pull request as ready for review March 18, 2025 16:56

zuiderkwast reviewed Mar 21, 2025

View reviewed changes

src/hyperloglog.c Outdated Show resolved Hide resolved

xbasel commented Mar 22, 2025

View reviewed changes

src/hyperloglog.c Show resolved Hide resolved

use macro

36eaac6

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

xbasel force-pushed the hll_neon branch from d93927e to 36eaac6 Compare April 10, 2025 12:31

Merge remote-tracking branch 'origin/unstable' into hll_neon

e9a4028

xbasel requested a review from zuiderkwast April 10, 2025 12:58

lipzhu reviewed Apr 14, 2025

View reviewed changes

limit to aarch64 which is guaranteed to have neon

5e003e1

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

xbasel requested a review from lipzhu April 29, 2025 14:17

Merge remote-tracking branch 'origin/unstable' into hll_neon

9f4aed7

xbasel marked this pull request as draft April 29, 2025 16:17

zuiderkwast reviewed Apr 29, 2025

View reviewed changes

src/hyperloglog.c Outdated Show resolved Hide resolved

Update src/hyperloglog.c

1fb9d04

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

xbasel requested a review from zuiderkwast April 29, 2025 18:57

xbasel marked this pull request as ready for review April 29, 2025 18:58

zuiderkwast reviewed Apr 29, 2025

View reviewed changes

src/hyperloglog.c Outdated Show resolved Hide resolved

simplify

d915627

Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>

Merge remote-tracking branch 'origin/unstable' into hll_neon

1fcf2b9

xbasel requested a review from zuiderkwast April 30, 2025 08:56

zuiderkwast approved these changes Apr 30, 2025

View reviewed changes

zuiderkwast added the release-notes This issue should get a line item in the release notes label Apr 30, 2025

lipzhu approved these changes May 1, 2025

View reviewed changes

zuiderkwast merged commit dd772c4 into valkey-io:unstable May 1, 2025
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyperloglog ARM NEON SIMD optimization #1859

Hyperloglog ARM NEON SIMD optimization #1859

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hyperloglog ARM NEON SIMD optimization #1859

Hyperloglog ARM NEON SIMD optimization #1859

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!