8000 [IBD] multi-byte block obfuscation by l0rinc · Pull Request #31144 · bitcoin/bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[IBD] multi-byte block obfuscation #31144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

l0rinc
Copy link
Contributor
@l0rinc l0rinc commented Oct 24, 2024

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

Summary

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks.

Since this obfuscation is optional, the speedup measured here depends on whether it's a random value or completely turned off (i.e. XOR-ing with 0).

Changes in testing, benchmarking and implementation

  • Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
  • Migrated std::vector<std::byte>(8) keys to plain uint64_t;
  • Process unaligned bytes separately and unroll body to 64 bytes.

Assembly

Memory alignment is enforced by a small peel-loop (std::memcpy is optimized out on tested platform), with an std::assume_aligned<8> check, see the Godbolt listing at https://godbolt.org/z/35nveanf5 for details

Details
Target & Compiler Stride (per hot-loop iter) Main operation(s) in loop Effective XORs / iter
Clang x86-64 (trunk) 64 bytes 4 × movdqu → pxor → store 8 × 64-bit
GCC x86-64 (trunk) 64 bytes 4 × movdqu/pxor sequence, enabled by 8-way unroll 8 × 64-bit
GCC RV32 (trunk) 8 bytes copy 8 B to temp → 2 × 32-bit XOR → copy back 1 × 64-bit (as 2 × 32-bit)
GCC s390x (big-endian 14.2) 64 bytes 8 × XC (mem-mem 8-B XOR) with key cached on stack 8 × 64-bit

Endianness

The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption.
Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order.
Since CI lacks a big-endian machine, testing was done locally via Docker.

Details
brew install podman pigz
softwareupdate --install-rosetta
podman machine init
podman machine start
docker run --platform linux/s390x -it ubuntu:latest /bin/bash
  apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
  cd /mnt && git clone --depth=1 https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
  cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
  ./build/bin/test_bitcoin --run_test=streams_tests

Measurements (micro benchmarks and full IBDs)

cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc/clang -DCMAKE_CXX_COMPILER=g++/clang++ &&
cmake --build build -j$(nproc) &&
build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000

Clang 20.1.6

Before:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
926,379.31 1,079.47 0.1% 6,815,747.36 3,325,871.40 2.049 524,289.23 0.0% 5.50 ObfuscationBench

After:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
74,817.84 13,365.80 0.0% 655,366.68 268,566.88 2.440 131,074.08 0.0% 5.50 ObfuscationBench

and

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
34,770.15 28,760.30 0.0% 262,149.44 124,756.60 2.101 16,384.83 0.0% 5.32 ObfuscationBench
GNU 14.2.0

Before:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
879,957.05 1,136.42 0.0% 9,437,186.42 3,158,477.41 2.988 1,048,576.99 0.0% 5.50 ObfuscationBench

After:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
51,665.69 19,355.21 0.0% 327,684.05 185,409.22 1.767 65,536.55 0.0% 5.50 ObfuscationBench

and

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
37,281.01 26,823.31 0.1% 475,139.64 133,808.24 3.551 81,920.54 0.0% 5.35 ObfuscationBench

i.e. 26.6x faster obfuscation with GCC, 23.6x faster with Clang

For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144


Running an IBD until 888888 blocks reveals a 4% speedup.

Details

SSD:

COMMITS="8324a00bd4a6a5291c841f2d01162d8a014ddb02 5ddfd31b4158a89b0007cfb2be970c03d9278525"; \
STOP_HEIGHT=888888; DBCACHE=1000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
hyperfine \
  --sort 'command' \
  --runs 1 \
  --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && \
    cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

8324a00 test: Compare util::Xor with randomized inputs against simple impl
5ddfd31 optimization: Xor 64 bits together instead of byte-by-byte

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00bd4a6a5291c841f2d01162d8a014ddb02)
  Time (abs ≡):        25033.413 s               [User: 33953.984 s, System: 2613.604 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31b4158a89b0007cfb2be970c03d9278525)
  Time (abs ≡):        24110.710 s               [User: 33389.536 s, System: 2660.292 s]

Relative speed comparison
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00bd4a6a5291c841f2d01162d8a014ddb02)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31b4158a89b0007cfb2be970c03d9278525)

HDD:

COMMITS="71eb6eaa740ad0b28737e90e59b89a8e951d90d9 46854038e7984b599d25640de26d4680e62caba7"; \
STOP_HEIGHT=888888; DBCACHE=4500; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
hyperfine \
  --sort 'command' \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

71eb6ea test: compare util::Xor with randomized inputs against simple impl
4685403 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
  Time (mean ± σ):     37676.293 s ± 83.100 s    [User: 36900.535 s, System: 2220.382 s]
  Range (minmax):   37617.533 s37735.053 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 46854038e7984b599d25640de26d4680e62caba7)
  Time (mean ± σ):     36181.287 s ± 195.248 s    [User: 34962.822 s, System: 1988.614 s]
  Range (minmax):   36043.226 s36319.349 s    2 runs

Relative speed comparison
        1.04 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 46854038e7984b599d25640de26d4680e62caba7)

@DrahtBot
Copy link
Contributor
DrahtBot commented Oct 24, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31144.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK hodlinator, yuvicc
Concept ACK ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
  • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
  • #29307 (util: explicitly close all AutoFiles that have been written by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@maflcko
Copy link
Member
maflcko commented Oct 24, 2024

I think your example may be a bit skewed? It shows how much time is spent when deserializing a CScript from a block file. However, block files contain full blocks, where many (most?) of the writes are single bytes (or 4 bytes), see #30833 (comment). Thus, it would be useful to know what the overall end-to-end performance difference is. Also taking into account the utxo db.

If you want the micro-benchmark to be representative, I'd presume you'd have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading.

@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 23f4809 to 6ae466b Compare October 24, 2024 09:59
@l0rinc
Copy link
Contributor Author
l0rinc commented Oct 24, 2024

where many (most?) of the writes are single bytes (or 4 bytes)

Thanks, I've extended your previous benchmarks with both Autofile serialization and very small vectors.
I will also run a reindex of 400k blocks before and after to see if the effect is measurable or not.

@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 353915b to a3dc138 Compare October 24, 2024 12:20
@l0rinc l0rinc changed the title optimization: pack util::Xor into 64 bit chunks instead of doing it byte-by-byte optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte Oct 24, 2024
@l0rinc l0rinc marked this pull request as ready for review October 24, 2024 14:22
@maflcko
Copy link
Member
maflcko commented Oct 24, 2024

It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren't shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way?

I did a quick check on my laptop and it seems the XorSmall (1+4 bytes) is slower with this pull. The Xor (modified to check 40 bytes) was twice as fast. Overall, I'd expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

I can try to bench on another machine later, to see if it makes a difference.

Can you clarify what type of machine you tested this on?

@l0rinc
Copy link
Contributor Author
l0rinc commented Oct 24, 2024

Are they saying that IBD was 4% faster?

That's what I'm measuring currently, but I don't expect more than 2% difference here.

Also, I think they were created with the UB version of this pull, so may be outdated either way?

Benchmarks indicated that the 64 bit compiled result was basically the same.

Overall, I'd expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

I'll investigate, thanks.

Details

Posting the perf here for reference:
image
Reindexing until 300k blocks reveals that XOR usage was reduced:
image

Copy link
Contributor
@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK a3dc138

Operating on CPU words rather than individual bytes. 👍

Not entirely clear to me from #31144 (comment) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.

The verbosity of std::memcpy hurts readability but alignment issues are real.

@l0rinc l0rinc marked this pull request as draft October 25, 2024 18:47
@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch 3 times, most recently from 95e38ac to cd61e8b Compare October 26, 2024 21:58
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/32109998785

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch from cd61e8b to d72968f Compare October 26, 2024 22:17
@l0rinc l0rinc changed the title optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte optimization: change XOR obfuscation key from std::vector<std::byte>(8) to uint64_t Oct 26, 2024
@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch from d72968f to 555f8c8 Compare October 27, 2024 15:17
@l0rinc
Copy link
Contributor Author
l0rinc commented Apr 26, 2025

Restored the deserialization validation in Obfuscation::Unserialize, we can't assert, but we can throw a std::logic_error, since during mempool fuzzing https://github.com/bitcoin/bitcoin/blob/master/src/node/mempool_persist.cpp#L141 catches and ignored these errors safely (managed to reproduce it on Linux, not sure why it's not reproducible on Mac).
I've also split out all renames to a single commit before any other refactor or optimization to simplify the higher-risk changes.
PR is ready for review again!

@l0rinc l0rinc requested review from hodlinator and sipa May 1, 2025 11:47
Copy link
Contributor
@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review bba6473.

Thanks for improving the dbwrapper test case and minimizing _hex_v usage in the end state of the PR!

@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 8a918bb to 9406c17 Compare May 9, 2025 15:06
@l0rinc
Copy link
Contributor Author
l0rinc commented May 9, 2025

Rebased, opened issue for unrelated test failure

F438

Copy link
Contributor
@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review 9406c17

Getting pretty close to ACK. One main inlined question about error handling in dbwrapper.cpp.

LogInfo("Using obfuscation key for %s: %s", fs::PathToString(params.path), HexStr(obfuscation_key_bytes));
m_obfuscation = MakeByteSpan(obfuscation_key_bytes).first<Obfuscation::SIZE_BYTES>();
} else {
LogError("Invalid obfuscation key for %s: %s - continuing unobfuscated!", fs::PathToString(params.path), HexStr(obfuscation_key_bytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really continue after reading an invalid key length? Feels like it would be good to throw an exception, or assert the condition instead of if.

(Could adjust the message/comment to include the possibility of us being old software running on some future data format we don't recognize?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against throwing in general - especially for constructors, but we're already reading files in there and already throwing via HandleError - changed and moved to the related refactoring commit.

} else if (version == MEMPOOL_DUMP_VERSION) {
std::vector<std::byte> obfuscation{8};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note in b085c52:
Seems like there's a newish footgun with curly-brace initialization of vectors.

This compiles because the passed value unambiguously specifies the length (std::byte isn't implicitly constructible from integer literals):

std::vector<std::byte> obfuscation{512};

The following fails, because it's interpreted as initializer-list construction, and the integer literal would be narrowed to fit into the first element of the vector:

std::vector<uint8_t> obfuscation{512};

So your code is correct, but I'm actually starting to feel some anti-curly sentiment brewing for vector initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ ₘₐ𝓖ᵢ𝒸 ✨

@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 2fdc2a7 to 989537f Compare May 19, 2025 15:00
Copy link
Contributor
@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 989537f

Optimizing by operating on whole-word units instead of byte-for-byte makes for more efficient use of common hardware. Extracting the optimization into the Obfuscation abstraction which caches whole-word "rotations" (the XOR-key pre-rotated at all possible byte-offsets) makes sense.

Clear speedup in benchmarks. Even if the least improved of those measured (ReadBlockBench) would be the most representative of real-world use, it is sped up by ~15-16%.

Benchmarks

  • ObfuscationBench: 12x speedup on Clang, 17x speedup on GCC.
  • ReadBlockBench: 15-16% speedup on both.
  • ReadRawBlockBench: ~9x speedup on both.

Details

Used ./build/bin/bench_bitcoin --filter="ObfuscationBench|ReadBlockBench|ReadRawBlockBench" --min-time=10000 on NixOS.

GCC 14.2.1

Before (06353e2, 2nd commit in PR):

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
861,980.77 1,160.12 0.2% 9,437,186.38 3,177,041.69 2.970 1,048,576.97 0.0% 10.79 ObfuscationBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
5,509,795.20 181.49 0.5% 64,457,565.88 20,006,289.95 3.222 3,636,177.14 0.3% 11.03 ReadBlockBench
936,745.22 1,067.53 0.3% 9,010,771.90 3,232,301.21 2.788 1,002,214.88 0.0% 10.99 ReadRawBlockBench

After (989537f, last commit in PR):

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
51,720.05 19,334.86 0.4% 327,684.05 190,464.16 1.720 65,536.55 0.0% 10.99 ObfuscationBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
4,675,373.37 213.89 0.3% 55,774,507.70 16,981,281.76 3.284 2,699,256.61 0.5% 10.98 ReadBlockBench
105,697.36 9,460.97 0.2% 324,342.05 175,063.45 1.853 64,840.05 0.0% 10.99 ReadRawBlockBench

Clang Clang 20.1.3

Before (06353e2, 2nd commit in PR):

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
907,086.11 1,102.43 0.1% 6,815,747.33 3,342,513.78 2.039 524,289.21 0.0% 11.01 ObfuscationBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
5,509,068.75 181.52 0.1% 61,830,240.13 19,977,540.00 3.095 3,020,308.24 0.4% 10.98 ReadBlockBench
924,286.56 1,081.92 0.3% 6,511,639.90 3,190,509.11 2.041 502,479.88 0.0% 10.97 ReadRawBlockBench

After (989537f, last commit in PR):

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
75,454.29 13,253.06 0.6% 655,366.68 277,944.85 2.358 131,074.08 0.0% 10.71 ObfuscationBench
ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
4,632,323.51 215.87 0.1% 55,587,644.66 16,818,777.18 3.305 2,552,825.55 0.4% 11.00 ReadBlockBench
105,903.19 9,442.59 0.1% 262,149.05 176,010.62 1.489 33,762.05 0.1% 11.00 ReadRawBlockBench

Comment on lines 29 to 52
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
Xor(target, rot_key, SIZE_BYTES);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

informational: Tried to optimize for aligned writes:

Suggested change
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
Xor(target, rot_key, SIZE_BYTES);
}
uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())};
Xor(target, rot_key, alignment_remaining);
target = target.subspan(alignment_remaining);
rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES];
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
}

2x speedup on Clang for ObfuscationBench. But no difference, or a few percent worse on GCC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - while I was testing this out, I couldn't find a unit test that tested obfuscating a target buffer starting at different offsets in memory. Might be good to add one?

Copy link
Contributor Author
@l0rinc l0rinc May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
    if (!*this) return;
    const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
    std::cout << "key_offset_bytes % SIZE_BYTES = " << (key_offset_bytes % SIZE_BYTES) << std::endl;

which reveals:

key_offset_bytes % SIZE_BYTES = 4
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 7
key_offset_bytes % SIZE_BYTES = 4
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 1
key_offset_bytes % SIZE_BYTES = 6
key_offset_bytes % SIZE_BYTES = 1
key_offset_bytes % SIZE_BYTES = 7
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 3
key_offset_bytes % SIZE_BYTES = 3
key_offset_bytes % SIZE_BYTES = 0
key_offset_bytes % SIZE_BYTES = 5
key_offset_bytes % SIZE_BYTES = 2
...

I'll investigate the alignment and whether we can use SIMD in GCC as well (similarly to how Clang auto-detects it)

Copy link
Contributor Author
@l0rinc l0rinc May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the alignment part, seems dangerous - do you see an explanation in the assembly why it would be faster?
I tried reproducing it, but I'm getting a significant slowdown instead (ran each 3x):

Current loop:
void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
    if (!*this) return;
    const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
    for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
        Xor(target, rot_key, SIZE_BYTES);
    }
    Xor(target, rot_key, target.size());
}
ns/MiB MiB/s err% total benchmark
14,543.90 68,757.33 0.1% 11.01 ObfuscationBench
14,604.52 68,471.95 0.0% 11.01 ObfuscationBench
14,573.71 68,616.72 0.2% 11.04 ObfuscationBench
Suggestion:
void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const
{
    if (!*this) return;
    uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
    const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())};
    Xor(target, rot_key, alignment_remaining);
    target = target.subspan(alignment_remaining);
    rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES];
    for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
        *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
    }
    Xor(target, rot_key, target.size());
}
ns/MiB MiB/s err% total benchmark
19,552.98 51,143.11 0.1% 11.01 ObfuscationBench
19,529.72 51,204.01 0.2% 11.00 ObfuscationBench
20,046.53 49,883.94 0.1% 11.03 ObfuscationBench

However, looking at the assembly closer, I reiterated why gcc doesn't do any SIMD and experimented with a few options and it seems we can help GCC by doing a big unrolled iteration before the whole thing:

for (constexpr auto unroll{8}; target.size() >= SIZE_BYTES * unroll; target = target.subspan(SIZE_BYTES * unroll)) {
    for (size_t i{0}; i < unroll; ++i) {
        Xor(target.subspan(i * SIZE_BYTES, SIZE_BYTES), rot_key, SIZE_BYTES);
    }
}

which helps GCC recognise that the loop is free of dependencies and performs a 128-bit SIMD pxor over 64-byte blocks instead of doing two 64-bit scalar xors per 16-byte stride - speeding up the benchmark from the mentioned ~19 GiB/s to ~27 GiB/s - i.e. additional 40% speedup. (Clang is unaffected)
I'll do an IBD again to see if it's significant enough to include it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see an explanation in the assembly why it would be faster?

Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a co F438 mbination of objdump with a bunch of flags + searching in less for the consteval hex stuff ~9 months ago but suspect Obfuscation-methods get heavily inlined?

Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:

At tip of PR 989537f:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
73,504.93 13,604.53 0.1% 655,366.68 270,831.04 2.420 131,074.07 0.0% 10.66 ObfuscationBench

With diff from #31144 (comment) applied:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
37,501.52 26,665.59 0.9% 294,930.14 138,093.33 2.136 32,771.94 0.0% 11.03 ObfuscationBench

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

That only tests with an offset into the key, not an offset into the target, right?

I'm thinking something like:

diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 1985b24663..2e10792710 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
     for (size_t test{0}; test < 100; ++test) {
         const size_t write_size{1 + m_rng.randrange(100U)};
         const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around
+        const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))};
 
         std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{};
         if (m_rng.randbool()) m_rng.fillrand(key_bytes);
@@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
         std::vector expected{m_rng.randbytes<std::byte>(write_size)};
         std::vector actual{expected};
 
-        expected_xor(expected, key_bytes, key_offset);
-        obfuscation(actual, key_offset);
+        expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset);
+        obfuscation(std::span{actual}.subspan(write_offset), key_offset);
 
         BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end());
     }

It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't able to use the small Godbolt sample to divine the reason for my 2x Clang speedup yet.

Fell back to objdump -gSC build/bin/bench_bitcoin and clearly see:

            *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;
  1107c0:       f3 0f 6f 4c d1 f0       movdqu -0x10(%rcx,%rdx,8),%xmm1
  1107c6:       f3 0f 6f 14 d1          movdqu (%rcx,%rdx,8),%xmm2
  1107cb:       66 0f ef c8             pxor   %xmm0,%xmm1
  1107cf:       66 0f ef d0             pxor   %xmm0,%xmm2
  1107d3:       f3 0f 7f 4c d1 f0       movdqu %xmm1,-0x10(%rcx,%rdx,8)
  1107d9:       f3 0f 7f 14 d1          movdqu %xmm2,(%rcx,%rdx,8)
  1107de:       48 83 c2 04             add    $0x4,%rdx
  1107e2:       48 39 d0                cmp    %rdx,%rax
  1107e5:       75 d9                   jne    1107c0 <Obfuscation::operator()(std::span<std::byte, 18446744073709551615ul>, unsigned long) const+0xf0>

from my suggestion, close to the ObfuscationBench symbols. But I couldn't spot pxor/xmm close-by in the baseline dump.

Maybe too synthetic to bother with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me longer than anticipated, but I have measured different scenarios here to understand where alignment matters and where it doesn't, with gcc, clang and apple clang

GCC 14 is 38.5% faster with alignment & 64 byte unroll compared to just a 64 bit xor

------ C++ compiler .......................... GNU 14.2.0 ------
rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1\
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& git log -1 --pretty=format:'%h %s'

00d77f4 refactor: unify xor vs obfuscation nomenclature

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
879,957.05 1,136.42 0.0% 9,437,186.42 3,158,477.41 2.988 1,048,576.99 0.0% 5.50 ObfuscationBench
879,916.05 1,136.47 0.0% 9,437,186.42 3,157,995.25 2.988 1,048,576.99 0.0% 5.50 ObfuscationBench

2daaa83 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
51,665.69 19,355.21 0.0% 327,684.05 185,409.22 1.767 65,536.55 0.0% 5.50 ObfuscationBench
52,520.96 19,040.02 0.2% 327,684.05 188,497.60 1.738 65,536.55 0.0% 5.51 ObfuscationBench

c69f744 experiment: aligned xor only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
51,601.83 19,379.16 0.0% 327,687.85 185,151.71 1.770 65,536.95 0.0% 5.50 ObfuscationBench
51,952.63 19,248.30 0.1% 327,687.85 186,403.90 1.758 65,536.95 0.0% 5.50 ObfuscationBench

9472af6 experiment: unroll only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
46,341.13 21,579.10 0.0% 475,139.45 166,272.12 2.858 81,920.35 0.0% 5.50 ObfuscationBench
45,936.99 21,768.95 0.0% 475,139.45 164,853.63 2.882 81,920.35 0.0% 5.50 ObfuscationBench

e9fd88b experiment: unroll + alignment

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
37,281.01 26,823.31 0.1% 475,139.64 133,808.24 3.551 81,920.54 0.0% 5.35 ObfuscationBench
37,652.77 26,558.47 0.1% 475,139.64 135,132.04 3.516 81,920.54 0.0% 5.35 ObfuscationBench

LLVM Clang20 is 2.15x faster with alignment & 64 byte unroll compared to just a 64 bit xor

------ C++ compiler .......................... Clang 20.1.6 ------
rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \
&& git log -1 --pretty=format:'%h %s'

00d77f4 refactor: unify xor vs obfuscation nomenclature

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
926,379.31 1,079.47 0.1% 6,815,747.36 3,325,871.40 2.049 524,289.23 0.0% 5.50 ObfuscationBench
926,608.63 1,079.20 0.0% 6,815,747.36 3,326,572.88 2.049 524,289.23 0.0% 5.50 ObfuscationBench

2daaa83 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
74,817.84 13,365.80 0.0% 655,366.68 268,566.88 2.440 131,074.08 0.0% 5.50 ObfuscationBench
75,351.98 13,271.05 0.1% 655,366.68 270,405.84 2.424 131,074.08 0.0% 5.51 ObfuscationBench

c69f744 experiment: aligned xor only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
38,396.57 26,043.99 0.1% 294,929.64 137,732.80 2.141 32,771.94 0.0% 5.35 ObfuscationBench
38,151.47 26,211.31 0.2% 294,929.64 136,877.12 2.155 32,771.94 0.0% 5.38 ObfuscationBench

9472af6 experiment: unroll only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
35,556.53 28,124.23 0.1% 262,148.44 127,610.00 2.054 16,384.64 0.0% 5.32 ObfuscationBench
36,409.31 27,465.50 0.1% 262,148.44 130,650.29 2.006 16,384.64 0.0% 5.33 ObfuscationBench

e9fd88b experiment: unroll + alignment

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
34,770.15 28,760.30 0.0% 262,149.44 124,756.60 2.101 16,384.83 0.0% 5.32 ObfuscationBench
36,214.90 27,612.94 0.0% 262,149.44 129,932.72 2.018 16,384.84 0.0% 5.33 ObfuscationBench

Combining unrolling with aligning readjusts the speed for Apple Clang (which was already doing these automatically so there's no speed difference because of the latest commit) and brings Clang and GCC closer to the max bus speed.

Clang needed your alignment to achieve that, and GCC needed the unrolled 64-byte XOR to take advantage of SIMD.
Added this as a separate commit with you as coauthor, explaining the extra optimizations + updated godbolt and table from

Target & Compiler Stride (per iter) Main operation(s) in loop Effective XORs
Clang x86-64 (trunk) 32 bytes two unaligned 128-bit loads → pxor with broadcast key → stores 4 × 64-bit
GCC x86-64 (trunk) 16 bytes two scalar 64-bit xor reg-mem instructions directly on the target 2 × 64-bit
GCC RV32 (trunk) 8 bytes copy 8 B to stack scratch → XOR via two 32-bit regs → copy back 1 × 64-bit
GCC s390x (big-endian 14.2) 64 bytes eight XC ops using key cached on stack 8 × 64-bit

to:

Target & Compiler Stride (per hot-loop iter) Main operation(s) in loop Effective XORs / iter
Clang x86-64 (trunk) 64 bytes 4 × movdqu → pxor → store (aligned after peel) 8 × 64-bit
GCC x86-64 (trunk) 64 bytes same 4 × movdqu/pxor sequence, enabled by 8-way unroll 8 × 64-bit
GCC RV32 (trunk) 8 bytes copy 8 B to temp stack → 2 × 32-bit XOR → copy back 1 × 64-bit (as 2 × 32-bit)
GCC s390x (big-endian 14.2) 64 bytes 8 × XC (mem-mem 8-B XOR) with key cached on stack 8 × 64-bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a build failure on 32 bit systems because of the alignment changes - fixed it by moving the std::assume_aligned check inside the alignment condition itself.
Remeasured the benchmarks (all 3 compilers are the same as the previous push) and redid the endianness check (as described in the PR) and updated the https://godbolt.org/z/35nveanf5 again.

Comment on lines 29 to 52
const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
Xor(target, rot_key, SIZE_BYTES);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see an explanation in the assembly why it would be faster?

Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a combination of objdump with a bunch of flags + searching in less for the consteval hex stuff ~9 months ago but suspect Obfuscation-methods get heavily inlined?

Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:

At tip of PR 989537f:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
73,504.93 13,604.53 0.1% 655,366.68 270,831.04 2.420 131,074.07 0.0% 10.66 ObfuscationBench

With diff from #31144 (comment) applied:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark
37,501.52 26,665.59 0.9% 294,930.14 138,093.33 2.136 32,771.94 0.0% 11.03 ObfuscationBench

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

That only tests with an offset into the key, not an offset into the target, right?

I'm thinking something like:

diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp
index 1985b24663..2e10792710 100644
--- a/src/test/streams_tests.cpp
+++ b/src/test/streams_tests.cpp
@@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
     for (size_t test{0}; test < 100; ++test) {
         const size_t write_size{1 + m_rng.randrange(100U)};
         const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around
+        const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))};
 
         std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{};
         if (m_rng.randbool()) m_rng.fillrand(key_bytes);
@@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference)
         std::vector expected{m_rng.randbytes<std::byte>(write_size)};
         std::vector actual{expected};
 
-        expected_xor(expected, key_bytes, key_offset);
-        obfuscation(actual, key_offset);
+        expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset);
+        obfuscation(std::span{actual}.subspan(write_offset), key_offset);
 
         BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end());
     }

It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.

l0rinc and others added 5 commits May 23, 2025 15:48
Since production code only uses keys of length 8, we're not testing with other values anymore

Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
Since a previous PR already solved the tiny byte-array xors during serialization, we're only concentrating on big continuous chunks now.

> C++ compiler .......................... GNU 14.2.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          879,957.05 |            1,136.42 |    0.0% |    9,437,186.42 |    3,158,477.41 |  2.988 |   1,048,576.99 |    0.0% |      5.50 | `ObfuscationBench`

> C++ compiler .......................... Clang 20.1.6

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          926,379.31 |            1,079.47 |    0.1% |    6,815,747.36 |    3,325,871.40 |  2.049 |     524,289.23 |    0.0% |      5.50 | `ObfuscationBench`
This simplifies the next commits.
Since `CDBWrapper::Read` still supports vector only, we can initialize `m_obfuscation` directly instead of using a separate helper.
`CreateObfuscation` was also inlined, replaced `key_exists` with `key_missing`, and simplified the `if` condition that writes a new obfuscation key.
Since we're already throwing via `HandleError` here, the obfuscation key is also validated by throwing if invalid.
10000
@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch from 989537f to 5627cd4 Compare May 23, 2025 13:49
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task MSan, depends: https://github.com/bitcoin/bitcoin/runs/42787791556
LLM reason (✨ experimental): The CI failure is due to a failed assertion related to memory alignment during streams_tests.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc and others added 2 commits May 23, 2025 17:00
…yte>` to `uint64_t`

Since `util::Xor` now operates on `uint64_t` keys, migrated the obfuscation key end‑to‑end - from use sites in util::Xor up through streams, LevelDB wrappers, and disk (de)serialization - replacing all former `std::vector<std::byte>` keys with `uint64_t` (we still serialize them as vectors but convert immediately to `uint64_t` on load). This is why tests still generate vector keys and convert them to `uint64_t` later instead of generating them directly.

We also short‑circuit `Xor` calls when the key is zero to avoid unnecessary calculations (e.g., `MakeWritableByteSpan`).

In `Obfuscation::Unserialize` we can safely throw an `std::logic_error` since during mempool fuzzing `mempool_persist.cpp#L141` catches and ignored these errors safely.

>  C++ compiler .......................... GNU 14.2.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           51,665.69 |           19,355.21 |    0.0% |      327,684.05 |      185,409.22 |  1.767 |      65,536.55 |    0.0% |      5.50 | `ObfuscationBench`

> C++ compiler .......................... Clang 20.1.6

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           74,817.84 |           13,365.80 |    0.0% |      655,366.68 |      268,566.88 |  2.440 |     131,074.08 |    0.0% |      5.50 | `ObfuscationBench`

Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
Benchmarks indicated that processing 8-byte words instead of bytes already gives an order of magnitude speed-up, but:
* GCC still emitted scalar code;
* Clang’s auto-vectorised loop ran on the slow unaligned-load path.

Fix contains:
* peeling the mis-aligned head enabled the hot loop starting at an 8-byte address;
* `std::assume_aligned<8>` tells the optimiser the promise holds - required to keep Apple Clang happy;
* manually unrolling the body to 64 bytes enabled GCC to auto-vectorise.

>  C++ compiler .......................... GNU 14.2.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           37,281.01 |           26,823.31 |    0.1% |      475,139.64 |      133,808.24 |  3.551 |      81,920.54 |    0.0% |      5.35 | `ObfuscationBench`

> C++ compiler .......................... Clang 20.1.6

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           34,770.15 |           28,760.30 |    0.0% |      262,149.44 |      124,756.60 |  2.101 |      16,384.83 |    0.0% |      5.32 | `ObfuscationBench`

Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
@l0rinc l0rinc force-pushed the l0rinc/optimize-xor branch from 5627cd4 to 63854e8 Compare May 23, 2025 15:15
Copy link
Contributor
@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 63854e8

While the code is made less straightforward due to optimizations, I think it's still worth doing it to minimize overhead from obfuscation.

Changes since first ACK (#31144 (review)):

  • Changed commit messages to include benchmark results from more recent GCC and newer vanilla instead of Apple-flavoured Clang.
  • Modified unit test to add verification of obfuscation starting at non-aligned offsets as suggested.
  • Renamed benchmark file as suggested.
  • Removed ReadBlockBench and ReadRawBlockBench results from commit message (verified to still improve at the end of this review).
  • Added new commit at the end with further optimization partially inspired by my suggestion ([IBD] multi-byte block obfuscation #31144 (comment))!
    • Adapts code to help compiler with loop unrolling and SIMD.
    • Instead of having *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key, the unrolling allows for using Obfuscation::Xor() with memcpys back & forth and still have compilers optimize effectively.
    • I like how this version calculates the number of initial non-aligned bytes.
    • std::assume_aligned - nice find!

Testing

Guix + Windows

Was curious about Windows performance, so I patched the Guix build to include bench_bitcoin.exe.

Patch
diff --git a/contrib/guix/libexec/build.sh b/contrib/guix/libexec/build.sh
index ae98eba744..25191f7e24 100755
--- a/contrib/guix/libexec/build.sh
+++ b/contrib/guix/libexec/build.sh
@@ -206,7 +206,7 @@ mkdir -p "$OUTDIR"
 ###########################
 
 # CONFIGFLAGS
-CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=OFF -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
+CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=ON -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
 
 # CFLAGS
 HOST_CFLAGS="-O2 -g"
diff --git a/src/bench/CMakeLists.txt b/src/bench/CMakeLists.txt
index 3003115068..56f78ab21d 100644
--- a/src/bench/CMakeLists.txt
+++ b/src/bench/CMakeLists.txt
@@ -58,6 +58,8 @@ target_raw_data_sources(bench_bitcoin NAMESPACE benchmark::data
   data/block413567.raw
 )
 
+add_windows_application_manifest(bench_bitcoin)
+
 target_link_libraries(bench_bitcoin
   core_interface
   test_util

Result on Windows Guix build show we are in the same ballpark as results in commit message (63854e8):

> bench_bitcoin.exe -filter=ObfuscationBench -min-time=10000

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           37,787.80 |           26,463.57 |    0.7% |     11.22 | `ObfuscationBench`

Result on Linux Guix build is similar as Windows, also those in commit messages and measurements with my suggested optimization:

₿ ./bench_bitcoin -filter=ObfuscationBench -min-time=10000

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           36,053.83 |           27,736.30 |    0.4% |      475,140.24 |      132,841.83 |  3.577 |      81,920.74 |    0.0% |     10.96 | `ObfuscationBench`

Prior to running doas perf system tune, it actually ran significantly faster (probably due to some unreliable CPU boosting logic):

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           29,425.58 |           33,984.04 |    0.8% |      475,140.23 |      107,611.98 |  4.415 |      81,920.73 |    0.0% |     11.04 | `ObfuscationBench`

ReadBlockBench + ReadRawBlockBench

Verified that related benchmarks still improved (non-Guix, NixOS, compiler Clang 20.1.3):

Before (2df824f, PR base):

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        5,440,481.19 |              183.81 |    0.1% |   61,834,466.65 |   19,827,449.24 |  3.119 |   3,028,151.40 |    0.4% |     10.98 | `ReadBlockBench`
|          919,632.97 |            1,087.39 |    0.0% |    6,511,624.89 |    3,187,581.46 |  2.043 |     502,474.87 |    0.0% |     11.00 | `ReadRawBlockBench`

After (63854e8, PR tip):

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        4,649,147.57 |              215.09 |    0.1% |   55,576,004.70 |   16,919,942.23 |  3.285 |   2,542,138.60 |    0.5% |     11.03 | `ReadBlockBench`
|           87,733.01 |           11,398.22 |    0.2% |      262,263.03 |      116,898.39 |  2.244 |      18,154.03 |    0.1% |     10.80 | `ReadRawBlockBench`

@yuvicc
Copy link
Contributor
yuvicc commented Jun 9, 2025

Concept ACK

Will benchmark this soon!

Copy link
Contributor
@yuvicc yuvicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-Review & Tested ACK 63854e8

The code lgtm, thanks to @hodlinator reviews.

Regarding Tests

  • x86_64 ubuntu 24.04.2
C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          107,903.53 |            9,267.54 |    0.1% |      475,140.31 |      452,373.92 |  1.050 |      81,920.81 |    0.0% |     10.99 | `ObfuscationBench`


C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          107,881.57 |            9,269.42 |    0.1% |      475,140.31 |      452,243.73 |  1.051 |      81,920.81 |    0.0% |     11.00 | `ObfuscationBench`

  • MacOS Sequoia 15.3.2
C++ compiler ................................................................... Clang 16.0.0
|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           22,649.25 |           44,151.57 |    4.7% |     10.80 | `ObfuscationBench`


C++ compiler ................................................................... Clang 18.1.7
|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           21,154.05 |           47,272.27 |    0.1% |     10.99 | `ObfuscationBench`

commit 3959a90

C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          117,618.83 |            8,502.04 |    0.1% |      655,362.22 |      494,070.94 |  1.326 |     131,072.42 |    0.0% |      5.51 | `ObfuscationBench`

C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          113,280.61 |            8,827.64 |    0.1% |      327,683.32 |      475,761.88 |  0.689 |      65,536.51 |    0.0% |      5.50 | `ObfuscationBench`

Commit be3435c

C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          788,559.17 |            1,268.14 |    0.1% |    6,815,747.22 |    3,366,204.35 |  2.025 |     524,289.09 |    0.0% |      5.50 | `ObfuscationBench`

C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          604,312.84 |            1,654.77 |    0.1% |    9,437,186.04 |    2,579,119.63 |  3.659 |   1,048,576.71 |    0.0% |      5.51 | `ObfuscationBench`

I will be doing IBD test on my machine and share the results soon!

@maflcko
Copy link
Member
maflcko commented Jun 16, 2025

For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144

I don't think this list is complete. At least for me it is truncated, due to the benchmark rename in the early commit. I tried to fix it in corecheck/corecheck#90, but I am not sure how to re-run it here. Maybe rebase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0