-
Notifications
You must be signed in to change notification settings - Fork 790
Hyperloglog ARM NEON SIMD optimization #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1859 +/- ##
============================================
- Coverage 70.96% 70.95% -0.01%
============================================
Files 123 123
Lines 66135 66133 -2
============================================
- Hits 46934 46926 -8
- Misses 19201 19207 +6
🚀 New features to boost your workflow:
|
d5cc649
to
b2c857e
Compare
- Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t} +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ Hardware: CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 times faster is a pretty good improvement. :)
I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?
It is similar. NEON vectors are 128 bit, AVX2 is 256 bit. The padding and lookup is a bit different in AVX2. |
src/hyperloglog.c
Outdated
#ifdef __ARM_NEON | ||
static int simd_enabled = 1; | ||
#define HLL_USE_NEON (simd_enabled) | ||
#else | ||
#define HLL_USE_NEON 0 | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need add a runtime check here when server startup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've limited this to aarch64, which is guaranteed to have neon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is guaranteed to have NEON support in the compile environment. Is it possible to run AArch64 binaries on a non-neon platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, all AArch64 include NEON. AArch64 was new in ARMv8. https://en.wikipedia.org/wiki/ARM_architecture_family#Armv8. (Even older ARM 32-bit have NEON but we don't care about those for SIMD.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm wondering if we compiled on the AArch64 platform and released the ARM version binary, but the binary is running on an older ARM platform that doesn't support the NEON architecture (e.g., ARMv6). If this edge case isn't a concern, I'm fine with skipping the runtime checker.
https://developer.arm.com/documentation/102474/0100/Fundamentals-of-Armv8-Neon-technology
AArch64 is the name used to describe the 64-bit Execution state of the Armv8-A architecture. In AArch64 state, the processor executes the A64 instruction set, which contains Neon instructions (also referred to as SIMD instructions). GNU and Linux documentation sometimes refers to AArch64 as ARM64.
Older architectures do not support AArch64 execution state, so the binary won't run.
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I didn't review the actual SIMD code very carefully, but we have a test to compare the results with/without SIMD so I think it's safe.
I'll wait for @lipzhu's approval before merge.
The code is actually already being tested with SIMD vs Scalar in
This is an ARMv8-A 64-bit CPU (and supports NEON). |
Yeah, I know. That's why I said "we have a test to compare the results with/without SIMD so I think it's safe". |
Add ARM NEON optimization for HyperLogLog
Implement two NEON optmized functions for converting between raw and
dense representations in HyperLogLog:
These functions process 16 registers in each iteration.
Utilize existing SIMD test in hyperloglog.tcl (previously added for
AVX2 optimization) to validate NEON implementation
Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}
Hardware:
Command stats:
Before:
After:
Improved by ~14.7x.
Functional testing command:
The SIMD test randomizes input and comapres scalar vs simd results.