-
Notifications
You must be signed in to change notification settings - Fork 37.4k
[IBD] multi-byte block obfuscation #31144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31144. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
b76314a
to
10b9e68
Compare
I think your example may be a bit skewed? It shows how much time is spent when deserializing a If you want the micro-benchmark to be representative, I'd presume you'd have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading. |
23f4809
to
6ae466b
Compare
Thanks, I've extended your previous benchmarks with both Autofile serialization and very small vectors. |
353915b
to
a3dc138
Compare
It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren't shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way? I did a quick check on my laptop and it seems the I can try to bench on another machine later, to see if it makes a difference. Can you clarify what type of machine you tested this on? |
That's what I'm measuring currently, but I don't expect more than 2% difference here.
Benchmarks indicated that the 64 bit compiled result was basically the same.
I'll investigate, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK a3dc138
Operating on CPU words rather than individual bytes. 👍
Not entirely clear to me from #31144 (comment) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.
The verbosity of std::memcpy
hurts readability but alignment issues are real.
95e38ac
to
cd61e8b
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
cd61e8b
to
d72968f
Compare
std::vector<std::byte>(8)
to uint64_t
d72968f
to
555f8c8
Compare
Restored the deserialization validation in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review bba6473.
Thanks for improving the dbwrapper test case and minimizing _hex_v
usage in the end state of the PR!
8a918bb
to
9406c17
Compare
Rebased, opened issue for unrelated test failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review 9406c17
Getting pretty close to ACK. One main inlined question about error handling in dbwrapper.cpp.
src/dbwrapper.cpp
Outdated
LogInfo("Using obfuscation key for %s: %s", fs::PathToString(params.path), HexStr(obfuscation_key_bytes)); | ||
m_obfuscation = MakeByteSpan(obfuscation_key_bytes).first<Obfuscation::SIZE_BYTES>(); | ||
} else { | ||
LogError("Invalid obfuscation key for %s: %s - continuing unobfuscated!", fs::PathToString(params.path), HexStr(obfuscation_key_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really continue after reading an invalid key length? Feels like it would be good to throw an exception, or assert
the condition instead of if
.
(Could adjust the message/comment to include the possibility of us being old software running on some future data format we don't recognize?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against throwing in general - especially for constructors, but we're already reading files in there and already throwing via HandleError
- changed and moved to the related refactoring commit.
src/node/mempool_persist.cpp
Outdated
} else if (version == MEMPOOL_DUMP_VERSION) { | ||
std::vector<std::byte> obfuscation{8}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note in b085c52:
Seems like there's a newish footgun with curly-brace initialization of vector
s.
This compiles because the passed value unambiguously specifies the length (std::byte
isn't implicitly constructible from integer literals):
std::vector<std::byte> obfuscation{512};
The following fails, because it's interpreted as initializer-list construction, and the integer literal would be narrowed to fit into the first element of the vector
:
std::vector<uint8_t> obfuscation{512};
So your code is correct, but I'm actually starting to feel some anti-curly sentiment brewing for vector
initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨ ₘₐ𝓖ᵢ𝒸 ✨
2fdc2a7
to
989537f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 989537f
Optimizing by operating on whole-word units instead of byte-for-byte makes for more efficient use of common hardware. Extracting the optimization into the Obfuscation
abstraction which caches whole-word "rotations" (the XOR-key pre-rotated at all possible byte-offsets) makes sense.
Clear speedup in benchmarks. Even if the least improved of those measured (ReadBlockBench
) would be the most representative of real-world use, it is sped up by ~15-16%.
Benchmarks
- ObfuscationBench: 12x speedup on Clang, 17x speedup on GCC.
- ReadBlockBench: 15-16% speedup on both.
- ReadRawBlockBench: ~9x speedup on both.
Details
Used ./build/bin/bench_bitcoin --filter="ObfuscationBench|ReadBlockBench|ReadRawBlockBench" --min-time=10000
on NixOS.
GCC 14.2.1
Before (06353e2, 2nd commit in PR):
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
861,980.77 | 1,160.12 | 0.2% | 9,437,186.38 | 3,177,041.69 | 2.970 | 1,048,576.97 | 0.0% | 10.79 | ObfuscationBench |
ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
5,509,795.20 | 181.49 | 0.5% | 64,457,565.88 | 20,006,289.95 | 3.222 | 3,636,177.14 | 0.3% | 11.03 | ReadBlockBench |
936,745.22 | 1,067.53 | 0.3% | 9,010,771.90 | 3,232,301.21 | 2.788 | 1,002,214.88 | 0.0% | 10.99 | ReadRawBlockBench |
After (989537f, last commit in PR):
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
51,720.05 | 19,334.86 | 0.4% | 327,684.05 | 190,464.16 | 1.720 | 65,536.55 | 0.0% | 10.99 | ObfuscationBench |
ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
4,675,373.37 | 213.89 | 0.3% | 55,774,507.70 | 16,981,281.76 | 3.284 | 2,699,256.61 | 0.5% | 10.98 | ReadBlockBench |
105,697.36 | 9,460.97 | 0.2% | 324,342.05 | 175,063.45 | 1.853 | 64,840.05 | 0.0% | 10.99 | ReadRawBlockBench |
Clang Clang 20.1.3
Before (06353e2, 2nd commit in PR):
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
907,086.11 | 1,102.43 | 0.1% | 6,815,747.33 | 3,342,513.78 | 2.039 | 524,289.21 | 0.0% | 11.01 | ObfuscationBench |
ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
5,509,068.75 | 181.52 | 0.1% | 61,830,240.13 | 19,977,540.00 | 3.095 | 3,020,308.24 | 0.4% | 10.98 | ReadBlockBench |
924,286.56 | 1,081.92 | 0.3% | 6,511,639.90 | 3,190,509.11 | 2.041 | 502,479.88 | 0.0% | 10.97 | ReadRawBlockBench |
After (989537f, last commit in PR):
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
75,454.29 | 13,253.06 | 0.6% | 655,366.68 | 277,944.85 | 2.358 | 131,074.08 | 0.0% | 10.71 | ObfuscationBench |
ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
4,632,323.51 | 215.87 | 0.1% | 55,587,644.66 | 16,818,777.18 | 3.305 | 2,552,825.55 | 0.4% | 11.00 | ReadBlockBench |
105,903.19 | 9,442.59 | 0.1% | 262,149.05 | 176,010.62 | 1.489 | 33,762.05 | 0.1% | 11.00 | ReadRawBlockBench |
src/obfuscation.h
Outdated
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off | ||
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time | ||
Xor(target, rot_key, SIZE_BYTES); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
informational: Tried to optimize for aligned writes:
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off | |
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time | |
Xor(target, rot_key, SIZE_BYTES); | |
} | |
uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off | |
const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())}; | |
Xor(target, rot_key, alignment_remaining); | |
target = target.subspan(alignment_remaining); | |
rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES]; | |
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time | |
*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key; | |
} |
2x speedup on Clang for ObfuscationBench
. But no difference, or a few percent worse on GCC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing - while I was testing this out, I couldn't find a unit test that tested obfuscating a target buffer starting at different offsets in memory. Might be good to add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xor_bytes_reference
randomly generates where the next offset will be, check it by adding:
void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
if (!*this) return;
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
std::cout << "key_offset_bytes % SIZE_BYTES = " << (key_offset_bytes % SIZE_BYTES) << std::endl;
which reveals:
key_offset_bytes % SIZE_BYTES = 4
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 7
key_offset_bytes % SIZE_BYTES = 4
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 1
key_offset_bytes % SIZE_BYTES = 6
key_offset_bytes % SIZE_BYTES = 1
key_offset_bytes % SIZE_BYTES = 7
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 3
key_offset_bytes % SIZE_BYTES = 3
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 5
key_offset_bytes % SIZE_BYTES = 2
...
I'll investigate the alignment and whether we can use SIMD in GCC as well (similarly to how Clang auto-detects it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the alignment part, seems dangerous - do you see an explanation in the assembly why it would be faster?
I tried reproducing it, but I'm getting a significant slowdown instead (ran each 3x):
Current loop:
void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
if (!*this) return;
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
Xor(target, rot_key, SIZE_BYTES);
}
Xor(target, rot_key, target.size());
}
ns/MiB | MiB/s | err% | total | benchmark |
---|---|---|---|---|
14,543.90 | 68,757.33 | 0.1% | 11.01 | ObfuscationBench |
14,604.52 | 68,471.95 | 0.0% | 11.01 | ObfuscationBench |
14,573.71 | 68,616.72 | 0.2% | 11.04 | ObfuscationBench |
Suggestion:
void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
if (!*this) return;
uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())};
Xor(target, rot_key, alignment_remaining);
target = target.subspan(alignment_remaining);
rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES];
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
}
Xor(target, rot_key, target.size());
}
ns/MiB | MiB/s | err% | total | benchmark |
---|---|---|---|---|
19,552.98 | 51,143.11 | 0.1% | 11.01 | ObfuscationBench |
19,529.72 | 51,204.01 | 0.2% | 11.00 | ObfuscationBench |
20,046.53 | 49,883.94 | 0.1% | 11.03 | ObfuscationBench |
However, looking at the assembly closer, I reiterated why gcc doesn't do any SIMD and experimented with a few options and it seems we can help GCC by doing a big unrolled iteration before the whole thing:
for (constexpr auto unroll{8}; target.size() >= SIZE_BYTES * unroll; target = target.subspan(SIZE_BYTES * unroll)) {
for (size_t i{0}; i < unroll; ++i) {
Xor(target.subspan(i * SIZE_BYTES, SIZE_BYTES), rot_key, SIZE_BYTES);
}
}
which helps GCC recognise that the loop is free of dependencies and performs a 128-bit SIMD pxor over 64-byte blocks instead of doing two 64-bit scalar xors per 16-byte stride - speeding up the benchmark from the mentioned ~19 GiB/s
to ~27 GiB/s
- i.e. additional 40% speedup. (Clang is unaffected)
I'll do an IBD again to see if it's significant enough to include it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you see an explanation in the assembly why it would be faster?
Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a co
F438
mbination of objdump
with a bunch of flags + searching in less
for the consteval hex stuff ~9 months ago but suspect Obfuscation
-methods get heavily inlined?
Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:
At tip of PR 989537f:
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
73,504.93 | 13,604.53 | 0.1% | 655,366.68 | 270,831.04 | 2.420 | 131,074.07 | 0.0% | 10.66 | ObfuscationBench |
With diff from #31144 (comment) applied:
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
37,501.52 | 26,665.59 | 0.9% | 294,930.14 | 138,093.33 | 2.136 | 32,771.94 | 0.0% | 11.03 | ObfuscationBench |
xor_bytes_reference
randomly generates where the next offset will be, check it by adding:
That only tests with an offset into the key, not an offset into the target, right?
I'm thinking something like:
diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 1985b24663..2e10792710 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
for (size_t test{0}; test < 100; ++test) {
const size_t write_size{1 + m_rng.randrange(100U)};
const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around
+ const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))};
std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{};
if (m_rng.randbool()) m_rng.fillrand(key_bytes);
@@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
std::vector expected{m_rng.randbytes<std::byte>(write_size)};
std::vector actual{expected};
- expected_xor(expected, key_bytes, key_offset);
- obfuscation(actual, key_offset);
+ expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset);
+ obfuscation(std::span{actual}.subspan(write_offset), key_offset);
BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end());
}
It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't able to use the small Godbolt sample to divine the reason for my 2x Clang speedup yet.
Fell back to objdump -gSC build/bin/bench_bitcoin
and clearly see:
*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
1107c0: f3 0f 6f 4c d1 f0 movdqu -0x10(%rcx,%rdx,8),%xmm1
1107c6: f3 0f 6f 14 d1 movdqu (%rcx,%rdx,8),%xmm2
1107cb: 66 0f ef c8 pxor %xmm0,%xmm1
1107cf: 66 0f ef d0 pxor %xmm0,%xmm2
1107d3: f3 0f 7f 4c d1 f0 movdqu %xmm1,-0x10(%rcx,%rdx,8)
1107d9: f3 0f 7f 14 d1 movdqu %xmm2,(%rcx,%rdx,8)
1107de: 48 83 c2 04 add $0x4,%rdx
1107e2: 48 39 d0 cmp %rdx,%rax
1107e5: 75 d9 jne 1107c0 <Obfuscation::operator()(std::span<std::byte, 18446744073709551615ul>, unsigned long) const+0xf0>
from my suggestion, close to the ObfuscationBench
symbols. But I couldn't spot pxor
/xmm
close-by in the baseline dump.
Maybe too synthetic to bother with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me longer than anticipated, but I have measured different scenarios here to understand where alignment matters and where it doesn't, with gcc, clang and apple clang
GCC 14 is 38.5% faster with alignment & 64 byte unroll compared to just a 64 bit xor
------ C++ compiler .......................... GNU 14.2.0 ------
rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1\
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& git log -1 --pretty=format:'%h %s'
00d77f4 refactor: unify xor vs obfuscation nomenclature
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
879,957.05 | 1,136.42 | 0.0% | 9,437,186.42 | 3,158,477.41 | 2.988 | 1,048,576.99 | 0.0% | 5.50 | ObfuscationBench |
879,916.05 | 1,136.47 | 0.0% | 9,437,186.42 | 3,157,995.25 | 2.988 | 1,048,576.99 | 0.0% | 5.50 | ObfuscationBench |
2daaa83 optimization: migrate fixed-size obfuscation from
std::vector<std::byte>
touint64_t
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
51,665.69 | 19,355.21 | 0.0% | 327,684.05 | 185,409.22 | 1.767 | 65,536.55 | 0.0% | 5.50 | ObfuscationBench |
52,520.96 | 19,040.02 | 0.2% | 327,684.05 | 188,497.60 | 1.738 | 65,536.55 | 0.0% | 5.51 | ObfuscationBench |
c69f744 experiment: aligned xor only
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
51,601.83 | 19,379.16 | 0.0% | 327,687.85 | 185,151.71 | 1.770 | 65,536.95 | 0.0% | 5.50 | ObfuscationBench |
51,952.63 | 19,248.30 | 0.1% | 327,687.85 | 186,403.90 | 1.758 | 65,536.95 | 0.0% | 5.50 | ObfuscationBench |
9472af6 experiment: unroll only
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
46,341.13 | 21,579.10 | 0.0% | 475,139.45 | 166,272.12 | 2.858 | 81,920.35 | 0.0% | 5.50 | ObfuscationBench |
45,936.99 | 21,768.95 | 0.0% | 475,139.45 | 164,853.63 | 2.882 | 81,920.35 | 0.0% | 5.50 | ObfuscationBench |
e9fd88b experiment: unroll + alignment
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
37,281.01 | 26,823.31 | 0.1% | 475,139.64 | 133,808.24 | 3.551 | 81,920.54 | 0.0% | 5.35 | ObfuscationBench |
37,652.77 | 26,558.47 | 0.1% | 475,139.64 | 135,132.04 | 3.516 | 81,920.54 | 0.0% | 5.35 | ObfuscationBench |
LLVM Clang20 is 2.15x faster with alignment & 64 byte unroll compared to just a 64 bit xor
------ C++ compiler .......................... Clang 20.1.6 ------
rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& git log -1 --pretty=format:'%h %s'
00d77f4 refactor: unify xor vs obfuscation nomenclature
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
926,379.31 | 1,079.47 | 0.1% | 6,815,747.36 | 3,325,871.40 | 2.049 | 524,289.23 | 0.0% | 5.50 | ObfuscationBench |
926,608.63 | 1,079.20 | 0.0% | 6,815,747.36 | 3,326,572.88 | 2.049 | 524,289.23 | 0.0% | 5.50 | ObfuscationBench |
2daaa83 optimization: migrate fixed-size obfuscation from
std::vector<std::byte>
touint64_t
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
74,817.84 | 13,365.80 | 0.0% | 655,366.68 | 268,566.88 | 2.440 | 131,074.08 | 0.0% | 5.50 | ObfuscationBench |
75,351.98 | 13,271.05 | 0.1% | 655,366.68 | 270,405.84 | 2.424 | 131,074.08 | 0.0% | 5.51 | ObfuscationBench |
c69f744 experiment: aligned xor only
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
38,396.57 | 26,043.99 | 0.1% | 294,929.64 | 137,732.80 | 2.141 | 32,771.94 | 0.0% | 5.35 | ObfuscationBench |
38,151.47 | 26,211.31 | 0.2% | 294,929.64 | 136,877.12 | 2.155 | 32,771.94 | 0.0% | 5.38 | ObfuscationBench |
9472af6 experiment: unroll only
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
35,556.53 | 28,124.23 | 0.1% | 262,148.44 | 127,610.00 | 2.054 | 16,384.64 | 0.0% | 5.32 | ObfuscationBench |
36,409.31 | 27,465.50 | 0.1% | 262,148.44 | 130,650.29 | 2.006 | 16,384.64 | 0.0% | 5.33 | ObfuscationBench |
e9fd88b experiment: unroll + alignment
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
34,770.15 | 28,760.30 | 0.0% | 262,149.44 | 124,756.60 | 2.101 | 16,384.83 | 0.0% | 5.32 | ObfuscationBench |
36,214.90 | 27,612.94 | 0.0% | 262,149.44 | 129,932.72 | 2.018 | 16,384.84 | 0.0% | 5.33 | ObfuscationBench |
Combining unrolling with aligning readjusts the speed for Apple Clang (which was already doing these automatically so there's no speed difference because of the latest commit) and brings Clang and GCC closer to the max bus speed.
Clang needed your alignment to achieve that, and GCC needed the unrolled 64-byte XOR to take advantage of SIMD.
Added this as a separate commit with you as coauthor, explaining the extra optimizations + updated godbolt and table from
Target & Compiler | Stride (per iter) | Main operation(s) in loop | Effective XORs |
---|---|---|---|
Clang x86-64 (trunk) | 32 bytes | two unaligned 128-bit loads → pxor with broadcast key → stores | 4 × 64-bit |
GCC x86-64 (trunk) | 16 bytes | two scalar 64-bit xor reg-mem instructions directly on the target | 2 × 64-bit |
GCC RV32 (trunk) | 8 bytes | copy 8 B to stack scratch → XOR via two 32-bit regs → copy back | 1 × 64-bit |
GCC s390x (big-endian 14.2) | 64 bytes | eight XC ops using key cached on stack | 8 × 64-bit |
to:
Target & Compiler | Stride (per hot-loop iter) | Main operation(s) in loop | Effective XORs / iter |
---|---|---|---|
Clang x86-64 (trunk) | 64 bytes | 4 × movdqu → pxor → store (aligned after peel) | 8 × 64-bit |
GCC x86-64 (trunk) | 64 bytes | same 4 × movdqu/pxor sequence, enabled by 8-way unroll | 8 × 64-bit |
GCC RV32 (trunk) | 8 bytes | copy 8 B to temp stack → 2 × 32-bit XOR → copy back | 1 × 64-bit (as 2 × 32-bit) |
GCC s390x (big-endian 14.2) | 64 bytes | 8 × XC (mem-mem 8-B XOR) with key cached on stack | 8 × 64-bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a build failure on 32 bit systems because of the alignment changes - fixed it by moving the std::assume_aligned
check inside the alignment condition itself.
Remeasured the benchmarks (all 3 compilers are the same as the previous push) and redid the endianness check (as described in the PR) and updated the https://godbolt.org/z/35nveanf5 again.
src/obfuscation.h
Outdated
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off | ||
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time | ||
Xor(target, rot_key, SIZE_BYTES); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you see an explanation in the assembly why it would be faster?
Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a combination of objdump
with a bunch of flags + searching in less
for the consteval hex stuff ~9 months ago but suspect Obfuscation
-methods get heavily inlined?
Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:
At tip of PR 989537f:
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
73,504.93 | 13,604.53 | 0.1% | 655,366.68 | 270,831.04 | 2.420 | 131,074.07 | 0.0% | 10.66 | ObfuscationBench |
With diff from #31144 (comment) applied:
ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
37,501.52 | 26,665.59 | 0.9% | 294,930.14 | 138,093.33 | 2.136 | 32,771.94 | 0.0% | 11.03 | ObfuscationBench |
xor_bytes_reference
randomly generates where the next offset will be, check it by adding:
That only tests with an offset into the key, not an offset into the target, right?
I'm thinking something like:
diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 1985b24663..2e10792710 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
for (size_t test{0}; test < 100; ++test) {
const size_t write_size{1 + m_rng.randrange(100U)};
const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around
+ const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))};
std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{};
if (m_rng.randbool()) m_rng.fillrand(key_bytes);
@@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
std::vector expected{m_rng.randbytes<std::byte>(write_size)};
std::vector actual{expected};
- expected_xor(expected, key_bytes, key_offset);
- obfuscation(actual, key_offset);
+ expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset);
+ obfuscation(std::span{actual}.subspan(write_offset), key_offset);
BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end());
}
It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.
Since production code only uses keys of length 8, we're not testing with other values anymore Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
Since a previous PR already solved the tiny byte-array xors during serialization, we're only concentrating on big continuous chunks now. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 879,957.05 | 1,136.42 | 0.0% | 9,437,186.42 | 3,158,477.41 | 2.988 | 1,048,576.99 | 0.0% | 5.50 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 926,379.31 | 1,079.47 | 0.1% | 6,815,747.36 | 3,325,871.40 | 2.049 | 524,289.23 | 0.0% | 5.50 | `ObfuscationBench`
This simplifies the next commits.
Since `CDBWrapper::Read` still supports vector only, we can initialize `m_obfuscation` directly instead of using a separate helper. `CreateObfuscation` was also inlined, replaced `key_exists` with `key_missing`, and simplified the `if` condition that writes a new obfuscation key. Since we're already throwing via `HandleError` here, the obfuscation key is also validated by throwing if invalid.
989537f
to
5627cd4
Compare
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
…yte>` to `uint64_t` Since `util::Xor` now operates on `uint64_t` keys, migrated the obfuscation key end‑to‑end - from use sites in util::Xor up through streams, LevelDB wrappers, and disk (de)serialization - replacing all former `std::vector<std::byte>` keys with `uint64_t` (we still serialize them as vectors but convert immediately to `uint64_t` on load). This is why tests still generate vector keys and convert them to `uint64_t` later instead of generating them directly. We also short‑circuit `Xor` calls when the key is zero to avoid unnecessary calculations (e.g., `MakeWritableByteSpan`). In `Obfuscation::Unserialize` we can safely throw an `std::logic_error` since during mempool fuzzing `mempool_persist.cpp#L141` catches and ignored these errors safely. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 51,665.69 | 19,355.21 | 0.0% | 327,684.05 | 185,409.22 | 1.767 | 65,536.55 | 0.0% | 5.50 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 74,817.84 | 13,365.80 | 0.0% | 655,366.68 | 268,566.88 | 2.440 | 131,074.08 | 0.0% | 5.50 | `ObfuscationBench` Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
Benchmarks indicated that processing 8-byte words instead of bytes already gives an order of magnitude speed-up, but: * GCC still emitted scalar code; * Clang’s auto-vectorised loop ran on the slow unaligned-load path. Fix contains: * peeling the mis-aligned head enabled the hot loop starting at an 8-byte address; * `std::assume_aligned<8>` tells the optimiser the promise holds - required to keep Apple Clang happy; * manually unrolling the body to 64 bytes enabled GCC to auto-vectorise. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 37,281.01 | 26,823.31 | 0.1% | 475,139.64 | 133,808.24 | 3.551 | 81,920.54 | 0.0% | 5.35 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 34,770.15 | 28,760.30 | 0.0% | 262,149.44 | 124,756.60 | 2.101 | 16,384.83 | 0.0% | 5.32 | `ObfuscationBench` Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
5627cd4
to
63854e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 63854e8
While the code is made less straightforward due to optimizations, I think it's still worth doing it to minimize overhead from obfuscation.
Changes since first ACK (#31144 (review)):
- Changed commit messages to include benchmark results from more recent GCC and newer vanilla instead of Apple-flavoured Clang.
- Modified unit test to add verification of obfuscation starting at non-aligned offsets as suggested.
- Renamed benchmark file as suggested.
- Removed
ReadBlockBench
andReadRawBlockBench
results from commit message (verified to still improve at the end of this review). - Added new commit at the end with further optimization partially inspired by my suggestion ([IBD] multi-byte block obfuscation #31144 (comment))!
- Adapts code to help compiler with loop unrolling and SIMD.
- Instead of having
*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key
, the unrolling allows for usingObfuscation::Xor()
withmemcpy
s back & forth and still have compilers optimize effectively. - I like how this version calculates the number of initial non-aligned bytes.
std::assume_aligned
- nice find!
Testing
Guix + Windows
Was curious about Windows performance, so I patched the Guix build to include bench_bitcoin.exe.
Patch
diff --git a/contrib/guix/libexec/build.sh b/contrib/guix/libexec/build.sh
index ae98eba744..25191f7e24 100755
--- a/contrib/guix/libexec/build.sh
+++ b/contrib/guix/libexec/build.sh
@@ -206,7 +206,7 @@ mkdir -p "$OUTDIR"
###########################
# CONFIGFLAGS
-CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=OFF -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
+CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=ON -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
# CFLAGS
HOST_CFLAGS="-O2 -g"
diff --git a/src/bench/CMakeLists.txt b/src/bench/CMakeLists.txt
index 3003115068..56f78ab21d 100644
--- a/src/bench/CMakeLists.txt
+++ b/src/bench/CMakeLists.txt
@@ -58,6 +58,8 @@ target_raw_data_sources(bench_bitcoin NAMESPACE benchmark::data
data/block413567.raw
)
+add_windows_application_manifest(bench_bitcoin)
+
target_link_libraries(bench_bitcoin
core_interface
test_util
Result on Windows Guix build show we are in the same ballpark as results in commit message (63854e8):
> bench_bitcoin.exe -filter=ObfuscationBench -min-time=10000
| ns/MiB | MiB/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 37,787.80 | 26,463.57 | 0.7% | 11.22 | `ObfuscationBench`
Result on Linux Guix build is similar as Windows, also those in commit messages and measurements with my suggested optimization:
₿ ./bench_bitcoin -filter=ObfuscationBench -min-time=10000
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 36,053.83 | 27,736.30 | 0.4% | 475,140.24 | 132,841.83 | 3.577 | 81,920.74 | 0.0% | 10.96 | `ObfuscationBench`
Prior to running doas perf system tune
, it actually ran significantly faster (probably due to some unreliable CPU boosting logic):
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 29,425.58 | 33,984.04 | 0.8% | 475,140.23 | 107,611.98 | 4.415 | 81,920.73 | 0.0% | 11.04 | `ObfuscationBench`
ReadBlockBench + ReadRawBlockBench
Verified that related benchmarks still improved (non-Guix, NixOS, compiler Clang 20.1.3):
Before (2df824f, PR base):
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 5,440,481.19 | 183.81 | 0.1% | 61,834,466.65 | 19,827,449.24 | 3.119 | 3,028,151.40 | 0.4% | 10.98 | `ReadBlockBench`
| 919,632.97 | 1,087.39 | 0.0% | 6,511,624.89 | 3,187,581.46 | 2.043 | 502,474.87 | 0.0% | 11.00 | `ReadRawBlockBench`
After (63854e8, PR tip):
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 4,649,147.57 | 215.09 | 0.1% | 55,576,004.70 | 16,919,942.23 | 3.285 | 2,542,138.60 | 0.5% | 11.03 | `ReadBlockBench`
| 87,733.01 | 11,398.22 | 0.2% | 262,263.03 | 116,898.39 | 2.244 | 18,154.03 | 0.1% | 10.80 | `ReadRawBlockBench`
Concept ACK Will benchmark this soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-Review & Tested ACK 63854e8
The code lgtm, thanks to @hodlinator reviews.
Regarding Tests
- x86_64 ubuntu 24.04.2
C++ compiler ................................................................... Clang 18.1.3
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 107,903.53 | 9,267.54 | 0.1% | 475,140.31 | 452,373.92 | 1.050 | 81,920.81 | 0.0% | 10.99 | `ObfuscationBench`
C++ compiler ................................................................... GNU 13.3.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 107,881.57 | 9,269.42 | 0.1% | 475,140.31 | 452,243.73 | 1.051 | 81,920.81 | 0.0% | 11.00 | `ObfuscationBench`
- MacOS Sequoia 15.3.2
C++ compiler ................................................................... Clang 16.0.0
| ns/MiB | MiB/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 22,649.25 | 44,151.57 | 4.7% | 10.80 | `ObfuscationBench`
C++ compiler ................................................................... Clang 18.1.7
| ns/MiB | MiB/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 21,154.05 | 47,272.27 | 0.1% | 10.99 | `ObfuscationBench`
commit 3959a90
C++ compiler ................................................................... Clang 18.1.3
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 117,618.83 | 8,502.04 | 0.1% | 655,362.22 | 494,070.94 | 1.326 | 131,072.42 | 0.0% | 5.51 | `ObfuscationBench`
C++ compiler ................................................................... GNU 13.3.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 113,280.61 | 8,827.64 | 0.1% | 327,683.32 | 475,761.88 | 0.689 | 65,536.51 | 0.0% | 5.50 | `ObfuscationBench`
Commit be3435c
C++ compiler ................................................................... Clang 18.1.3
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 788,559.17 | 1,268.14 | 0.1% | 6,815,747.22 | 3,366,204.35 | 2.025 | 524,289.09 | 0.0% | 5.50 | `ObfuscationBench`
C++ compiler ................................................................... GNU 13.3.0
| ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 604,312.84 | 1,654.77 | 0.1% | 9,437,186.04 | 2,579,119.63 | 3.659 | 1,048,576.71 | 0.0% | 5.51 | `ObfuscationBench`
I will be doing IBD test on my machine and share the results soon!
I don't think this list is complete. At least for me it is truncated, due to the benchmark rename in the early commit. I tried to fix it in corecheck/corecheck#90, but I am not sure how to re-run it here. Maybe rebase? |
This change is part of [IBD] - Tracking PR for speeding up Initial Block Download
Summary
Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks.
Since this obfuscation is optional, the speedup measured here depends on whether it's a random value or completely turned off (i.e. XOR-ing with 0).
Changes in testing, benchmarking and implementation
std::vector<std::byte>(8)
keys to plainuint64_t
;Assembly
Memory alignment is enforced by a small peel-loop (
std::memcpy
is optimized out on tested platform), with anstd::assume_aligned<8>
check, see the Godbolt listing at https://godbolt.org/z/35nveanf5 for detailsDetails
Endianness
The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption.
Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order.
Since CI lacks a big-endian machine, testing was done locally via Docker.
Details
Measurements (micro benchmarks and full IBDs)
Clang 20.1.6
ObfuscationBench
ObfuscationBench
and
ObfuscationBench
GNU 14.2.0
ObfuscationBench
ObfuscationBench
and
ObfuscationBench
i.e. 26.6x faster obfuscation with GCC, 23.6x faster with Clang
For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144
Running an IBD until 888888 blocks reveals a 4% speedup.
Details
SSD: