[IBD] multi-byte block obfuscation #31144

l0rinc · 2024-10-24T07:11:11Z

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

Summary

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches.
This is especially relevant now that #31551 was merged, having bigger obfuscatable chunks.

Since this obfuscation is optional, the speedup measured here depends on whether it's a random value or completely turned off (i.e. XOR-ing with 0).

Changes in testing, benchmarking and implementation

Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
Migrated std::vector<std::byte>(8) keys to plain uint64_t;
Process unaligned bytes separately and unroll body to 64 bytes.

Assembly

Memory alignment is enforced by a small peel-loop (std::memcpy is optimized out on tested platform), with an std::assume_aligned<8> check, see the Godbolt listing at https://godbolt.org/z/35nveanf5 for details

Details

Target & Compiler	Stride (per hot-loop iter)	Main operation(s) in loop	Effective XORs / iter
Clang x86-64 (trunk)	64 bytes	4 × movdqu → pxor → store	8 × 64-bit
GCC x86-64 (trunk)	64 bytes	4 × movdqu/pxor sequence, enabled by 8-way unroll	8 × 64-bit
GCC RV32 (trunk)	8 bytes	copy 8 B to temp → 2 × 32-bit XOR → copy back	1 × 64-bit (as 2 × 32-bit)
GCC s390x (big-endian 14.2)	64 bytes	8 × XC (mem-mem 8-B XOR) with key cached on stack	8 × 64-bit

Endianness

The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption.
Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order.
Since CI lacks a big-endian machine, testing was done locally via Docker.

Details

brew install podman pigz
softwareupdate --install-rosetta
podman machine init
podman machine start
docker run --platform linux/s390x -it ubuntu:latest /bin/bash
  apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
  cd /mnt && git clone --depth=1 https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
  cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
  ./build/bin/test_bitcoin --run_test=streams_tests

Measurements (micro benchmarks and full IBDs)

cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc/clang -DCMAKE_CXX_COMPILER=g++/clang++ &&
cmake --build build -j$(nproc) &&
build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000

Clang 20.1.6

Before:

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
926,379.31	1,079.47	0.1%	6,815,747.36	3,325,871.40	2.049	524,289.23	0.0%	5.50	`ObfuscationBench`

After:

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
74,817.84	13,365.80	0.0%	655,366.68	268,566.88	2.440	131,074.08	0.0%	5.50	`ObfuscationBench`

and

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
34,770.15	28,760.30	0.0%	262,149.44	124,756.60	2.101	16,384.83	0.0%	5.32	`ObfuscationBench`

GNU 14.2.0

Before:

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
879,957.05	1,136.42	0.0%	9,437,186.42	3,158,477.41	2.988	1,048,576.99	0.0%	5.50	`ObfuscationBench`

After:

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
51,665.69	19,355.21	0.0%	327,684.05	185,409.22	1.767	65,536.55	0.0%	5.50	`ObfuscationBench`

and

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
37,281.01	26,823.31	0.1%	475,139.64	133,808.24	3.551	81,920.54	0.0%	5.35	`ObfuscationBench`

i.e. 26.6x faster obfuscation with GCC, 23.6x faster with Clang

For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144

Running an IBD until 888888 blocks reveals a 4% speedup.

Details

SSD:

COMMITS="8324a00bd4a6a5291c841f2d01162d8a014ddb02 5ddfd31b4158a89b0007cfb2be970c03d9278525"; \
STOP_HEIGHT=888888; DBCACHE=1000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
hyperfine \
  --sort 'command' \
  --runs 1 \
  --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && \
    cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

8324a00 test: Compare util::Xor with randomized inputs against simple impl
5ddfd31 optimization: Xor 64 bits together instead of byte-by-byte

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00bd4a6a5291c841f2d01162d8a014ddb02)
  Time (abs ≡):        25033.413 s               [User: 33953.984 s, System: 2613.604 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31b4158a89b0007cfb2be970c03d9278525)
  Time (abs ≡):        24110.710 s               [User: 33389.536 s, System: 2660.292 s]

Relative speed comparison
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 8324a00bd4a6a5291c841f2d01162d8a014ddb02)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=1000 -blocksonly -printtoconsole=0 (COMMIT = 5ddfd31b4158a89b0007cfb2be970c03d9278525)

HDD:

COMMITS="71eb6eaa740ad0b28737e90e59b89a8e951d90d9 46854038e7984b599d25640de26d4680e62caba7"; \
STOP_HEIGHT=888888; DBCACHE=4500; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done) && \
hyperfine \
  --sort 'command' \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-${COMMITS// /-}-$STOP_HEIGHT-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 100" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

71eb6ea test: compare util::Xor with randomized inputs against simple impl
4685403 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
  Time (mean ± σ):     37676.293 s ± 83.100 s    [User: 36900.535 s, System: 2220.382 s]
  Range (min … max):   37617.533 s … 37735.053 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 46854038e7984b599d25640de26d4680e62caba7)
  Time (mean ± σ):     36181.287 s ± 195.248 s    [User: 34962.822 s, System: 1988.614 s]
  Range (min … max):   36043.226 s … 36319.349 s    2 runs

Relative speed comparison
        1.04 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 71eb6eaa740ad0b28737e90e59b89a8e951d90d9)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 46854038e7984b599d25640de26d4680e62caba7)

DrahtBot · 2024-10-24T07:11:15Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31144.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	hodlinator, yuvicc
Concept ACK	ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
#29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
#29307 (util: explicitly close all AutoFiles that have been written by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

src/streams.h

maflcko · 2024-10-24T08:27:45Z

I think your example may be a bit skewed? It shows how much time is spent when deserializing a CScript from a block file. However, block files contain full blocks, where many (most?) of the writes are single bytes (or 4 bytes), see #30833 (comment). Thus, it would be useful to know what the overall end-to-end performance difference is. Also taking into account the utxo db.

If you want the micro-benchmark to be representative, I'd presume you'd have to mimic the histogram of the sizes of writes. Just picking one (1024, or 1004), which is never hit in reality, and then optimizing for that may be misleading.

l0rinc · 2024-10-24T10:02:58Z

where many (most?) of the writes are single bytes (or 4 bytes)

Thanks, I've extended your previous benchmarks with both Autofile serialization and very small vectors.
I will also run a reindex of 400k blocks before and after to see if the effect is measurable or not.

src/streams.h

maflcko · 2024-10-24T19:14:18Z

It would be good to explain the jpegs in the description, or even remove them. They will be excluded from the merge commit and aren't shown, unless GitHub happens to be reachable and online. Are they saying that IBD was 4% faster? Also, I think they were created with the UB version of this pull, so may be outdated either way?

I did a quick check on my laptop and it seems the XorSmall (1+4 bytes) is slower with this pull. The Xor (modified to check 40 bytes) was twice as fast. Overall, I'd expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

I can try to bench on another machine later, to see if it makes a difference.

Can you clarify what type of machine you tested this on?

l0rinc · 2024-10-24T20:39:31Z

Are they saying that IBD was 4% faster?

That's what I'm measuring currently, but I don't expect more than 2% difference here.

Also, I think they were created with the UB version of this pull, so may be outdated either way?

Benchmarks indicated that the 64 bit compiled result was basically the same.

Overall, I'd expect it to be slower on my machine, due to the histogram of real data showing more small byte writes than long ones, IIRC.

I'll investigate, thanks.

Details

Posting the perf here for reference:

Reindexing until 300k blocks reveals that XOR usage was reduced:

hodlinator

Concept ACK a3dc138

Operating on CPU words rather than individual bytes. 👍

Not entirely clear to me from #31144 (comment) whether the optimizer is able to use SIMD. Guess picking through the binary of a GUIX-build would give a definitive answer.

The verbosity of std::memcpy hurts readability but alignment issues are real.

src/test/streams_tests.cpp

DrahtBot · 2024-10-26T21:58:16Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/32109998785}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc · 2025-04-26T14:57:43Z

Restored the deserialization validation in Obfuscation::Unserialize, we can't assert, but we can throw a std::logic_error, since during mempool fuzzing https://github.com/bitcoin/bitcoin/blob/master/src/node/mempool_persist.cpp#L141 catches and ignored these errors safely (managed to reproduce it on Linux, not sure why it's not reproducible on Mac).
I've also split out all renames to a single commit before any other refactor or optimization to simplify the higher-risk changes.
PR is ready for review again!

hodlinator

Code review bba6473.

Thanks for improving the dbwrapper test case and minimizing _hex_v usage in the end state of the PR!

src/dbwrapper.h

src/obfuscation.h

src/test/dbwrapper_tests.cpp

src/bench/xor.cpp

src/dbwrapper.cpp

src/streams.h

src/test/streams_tests.cpp

l0rinc · 2025-05-09T15:07:18Z

Rebased, opened issue for unrelated test failure

hodlinator

Code review 9406c17

Getting pretty close to ACK. One main inlined question about error handling in dbwrapper.cpp.

hodlinator · 2025-05-19T07:53:50Z

src/dbwrapper.cpp

+            LogInfo("Using obfuscation key for %s: %s", fs::PathToString(params.path), HexStr(obfuscation_key_bytes));
+            m_obfuscation = MakeByteSpan(obfuscation_key_bytes).first<Obfuscation::SIZE_BYTES>();
+        } else {
+            LogError("Invalid obfuscation key for %s: %s - continuing unobfuscated!", fs::PathToString(params.path), HexStr(obfuscation_key_bytes));


Should we really continue after reading an invalid key length? Feels like it would be good to throw an exception, or assert the condition instead of if.

(Could adjust the message/comment to include the possibility of us being old software running on some future data format we don't recognize?)

I'm against throwing in general - especially for constructors, but we're already reading files in there and already throwing via HandleError - changed and moved to the related refactoring commit.

hodlinator · 2025-05-19T08:04:37Z

src/node/mempool_persist.cpp

        } else if (version == MEMPOOL_DUMP_VERSION) {
+            std::vector<std::byte> obfuscation{8};


note in b085c52:
Seems like there's a newish footgun with curly-brace initialization of vectors.

This compiles because the passed value unambiguously specifies the length (std::byte isn't implicitly constructible from integer literals):

std::vector<std::byte> obfuscation{512};

The following fails, because it's interpreted as initializer-list construction, and the integer literal would be narrowed to fit into the first element of the vector:

std::vector<uint8_t> obfuscation{512};

So your code is correct, but I'm actually starting to feel some anti-curly sentiment brewing for vector initialization.

✨ ₘₐ𝓖ᵢ𝒸 ✨

src/test/streams_tests.cpp

src/obfuscation.h

src/dbwrapper.h

hodlinator

ACK 989537f

Optimizing by operating on whole-word units instead of byte-for-byte makes for more efficient use of common hardware. Extracting the optimization into the Obfuscation abstraction which caches whole-word "rotations" (the XOR-key pre-rotated at all possible byte-offsets) makes sense.

Clear speedup in benchmarks. Even if the least improved of those measured (ReadBlockBench) would be the most representative of real-world use, it is sped up by ~15-16%.

Benchmarks

ObfuscationBench: 12x speedup on Clang, 17x speedup on GCC.
ReadBlockBench: 15-16% speedup on both.
ReadRawBlockBench: ~9x speedup on both.

Details

Used ./build/bin/bench_bitcoin --filter="ObfuscationBench|ReadBlockBench|ReadRawBlockBench" --min-time=10000 on NixOS.

GCC 14.2.1

Before (06353e2, 2nd commit in PR):

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
861,980.77	1,160.12	0.2%	9,437,186.38	3,177,041.69	2.970	1,048,576.97	0.0%	10.79	`ObfuscationBench`

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
5,509,795.20	181.49	0.5%	64,457,565.88	20,006,289.95	3.222	3,636,177.14	0.3%	11.03	`ReadBlockBench`
936,745.22	1,067.53	0.3%	9,010,771.90	3,232,301.21	2.788	1,002,214.88	0.0%	10.99	`ReadRawBlockBench`

After (989537f, last commit in PR):

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
51,720.05	19,334.86	0.4%	327,684.05	190,464.16	1.720	65,536.55	0.0%	10.99	`ObfuscationBench`

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
4,675,373.37	213.89	0.3%	55,774,507.70	16,981,281.76	3.284	2,699,256.61	0.5%	10.98	`ReadBlockBench`
105,697.36	9,460.97	0.2%	324,342.05	175,063.45	1.853	64,840.05	0.0%	10.99	`ReadRawBlockBench`

Clang Clang 20.1.3

Before (06353e2, 2nd commit in PR):

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
907,086.11	1,102.43	0.1%	6,815,747.33	3,342,513.78	2.039	524,289.21	0.0%	11.01	`ObfuscationBench`

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
5,509,068.75	181.52	0.1%	61,830,240.13	19,977,540.00	3.095	3,020,308.24	0.4%	10.98	`ReadBlockBench`
924,286.56	1,081.92	0.3%	6,511,639.90	3,190,509.11	2.041	502,479.88	0.0%	10.97	`ReadRawBlockBench`

After (989537f, last commit in PR):

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
75,454.29	13,253.06	0.6%	655,366.68	277,944.85	2.358	131,074.08	0.0%	10.71	`ObfuscationBench`

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
4,632,323.51	215.87	0.1%	55,587,644.66	16,818,777.18	3.305	2,552,825.55	0.4%	11.00	`ReadBlockBench`
105,903.19	9,442.59	0.1%	262,149.05	176,010.62	1.489	33,762.05	0.1%	11.00	`ReadRawBlockBench`

src/obfuscation.h

hodlinator · May 20, 2025

src/obfuscation.h

+        const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
+        for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
+            Xor(target, rot_key, SIZE_BYTES);
+        }


informational: Tried to optimize for aligned writes:

Suggested change

const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off

for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time

Xor(target, rot_key, SIZE_BYTES);

}

uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off

const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())};

Xor(target, rot_key, alignment_remaining);

target = target.subspan(alignment_remaining);

rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES];

for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time

*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key;

}

2x speedup on Clang for ObfuscationBench. But no difference, or a few percent worse on GCC.

One more thing - while I was testing this out, I couldn't find a unit test that tested obfuscating a target buffer starting at different offsets in memory. Might be good to add one?

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const { if (!*this) return; const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off std::cout << "key_offset_bytes % SIZE_BYTES = " << (key_offset_bytes % SIZE_BYTES) << std::endl;

which reveals:

key_offset_bytes % SIZE_BYTES = 4 key_offset_bytes % SIZE_BYTES = 0 key_offset_bytes % SIZE_BYTES = 7 key_offset_bytes % SIZE_BYTES = 4 key_offset_bytes % SIZE_BYTES = 0 key_offset_bytes % SIZE_BYTES = 1 key_offset_bytes % SIZE_BYTES = 6 key_offset_bytes % SIZE_BYTES = 1 key_offset_bytes % SIZE_BYTES = 7 key_offset_bytes % SIZE_BYTES = 0 key_offset_bytes % SIZE_BYTES = 3 key_offset_bytes % SIZE_BYTES = 3 key_offset_bytes % SIZE_BYTES = 0 key_offset_bytes % SIZE_BYTES = 5 key_offset_bytes % SIZE_BYTES = 2 ...

I'll investigate the alignment and whether we can use SIMD in GCC as well (similarly to how Clang auto-detects it)

I'm not sure about the alignment part, seems dangerous - do you see an explanation in the assembly why it would be faster?
I tried reproducing it, but I'm getting a significant slowdown instead (ran each 3x):

Current loop:

void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const { if (!*this) return; const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time Xor(target, rot_key, SIZE_BYTES); } Xor(target, rot_key, target.size()); }

ns/MiB MiB/s err% total benchmark

14,543.90 68,757.33 0.1% 11.01 ObfuscationBench

14,604.52 68,471.95 0.0% 11.01 ObfuscationBench

14,573.71 68,616.72 0.2% 11.04 ObfuscationBench

Suggestion:

void operator()(std::span<std::byte> target, const size_t key_offset_bytes = 0) const { if (!*this) return; uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off const size_t alignment_remaining{std::min(SIZE_BYTES - (reinterpret_cast<ptrdiff_t>(target.data()) % SIZE_BYTES), target.size())}; Xor(target, rot_key, alignment_remaining); target = target.subspan(alignment_remaining); rot_key = m_rotations[(key_offset_bytes + alignment_remaining) % SIZE_BYTES]; for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key; } Xor(target, rot_key, target.size()); }

ns/MiB MiB/s err% total benchmark

19,552.98 51,143.11 0.1% 11.01 ObfuscationBench

19,529.72 51,204.01 0.2% 11.00 ObfuscationBench

20,046.53 49,883.94 0.1% 11.03 ObfuscationBench

However, looking at the assembly closer, I reiterated why gcc doesn't do any SIMD and experimented with a few options and it seems we can help GCC by doing a big unrolled iteration before the whole thing:

for (constexpr auto unroll{8}; target.size() >= SIZE_BYTES * unroll; target = target.subspan(SIZE_BYTES * unroll)) { for (size_t i{0}; i < unroll; ++i) { Xor(target.subspan(i * SIZE_BYTES, SIZE_BYTES), rot_key, SIZE_BYTES); } }

which helps GCC recognise that the loop is free of dependencies and performs a 128-bit SIMD pxor over 64-byte blocks instead of doing two 64-bit scalar xors per 16-byte stride - speeding up the benchmark from the mentioned ~19 GiB/s to ~27 GiB/s - i.e. additional 40% speedup. (Clang is unaffected)
I'll do an IBD again to see if it's significant enough to include it here.

do you see an explanation in the assembly why it would be faster?

Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a co F438 mbination of objdump with a bunch of flags + searching in less for the consteval hex stuff ~9 months ago but suspect Obfuscation-methods get heavily inlined?

Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:

At tip of PR 989537f:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

73,504.93 13,604.53 0.1% 655,366.68 270,831.04 2.420 131,074.07 0.0% 10.66 ObfuscationBench

With diff from #31144 (comment) applied:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

37,501.52 26,665.59 0.9% 294,930.14 138,093.33 2.136 32,771.94 0.0% 11.03 ObfuscationBench

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

That only tests with an offset into the key, not an offset into the target, right?

I'm thinking something like:

diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp index 1985b24663..2e10792710 100644 --- a/src/test/streams_tests.cpp +++ b/src/test/streams_tests.cpp @@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference) for (size_t test{0}; test < 100; ++test) { const size_t write_size{1 + m_rng.randrange(100U)}; const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around + const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))}; std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{}; if (m_rng.randbool()) m_rng.fillrand(key_bytes); @@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference) std::vector expected{m_rng.randbytes<std::byte>(write_size)}; std::vector actual{expected}; - expected_xor(expected, key_bytes, key_offset); - obfuscation(actual, key_offset); + expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset); + obfuscation(std::span{actual}.subspan(write_offset), key_offset); BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end()); }

It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.

Wasn't able to use the small Godbolt sample to divine the reason for my 2x Clang speedup yet.

Fell back to objdump -gSC build/bin/bench_bitcoin and clearly see:

*reinterpret_cast<uint64_t*>(target.data()) ^= rot_key; 1107c0: f3 0f 6f 4c d1 f0 movdqu -0x10(%rcx,%rdx,8),%xmm1 1107c6: f3 0f 6f 14 d1 movdqu (%rcx,%rdx,8),%xmm2 1107cb: 66 0f ef c8 pxor %xmm0,%xmm1 1107cf: 66 0f ef d0 pxor %xmm0,%xmm2 1107d3: f3 0f 7f 4c d1 f0 movdqu %xmm1,-0x10(%rcx,%rdx,8) 1107d9: f3 0f 7f 14 d1 movdqu %xmm2,(%rcx,%rdx,8) 1107de: 48 83 c2 04 add $0x4,%rdx 1107e2: 48 39 d0 cmp %rdx,%rax 1107e5: 75 d9 jne 1107c0 <Obfuscation::operator()(std::span<std::byte, 18446744073709551615ul>, unsigned long) const+0xf0>

from my suggestion, close to the ObfuscationBench symbols. But I couldn't spot pxor/xmm close-by in the baseline dump.

Maybe too synthetic to bother with.

Took me longer than anticipated, but I have measured different scenarios here to understand where alignment matters and where it doesn't, with gcc, clang and apple clang

GCC 14 is 38.5% faster with alignment & 64 byte unroll compared to just a 64 bit xor

------ C++ compiler .......................... GNU 14.2.0 ------

rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1\ && build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \ && build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \ && git log -1 --pretty=format:'%h %s'

00d77f4 refactor: unify xor vs obfuscation nomenclature

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

879,957.05 1,136.42 0.0% 9,437,186.42 3,158,477.41 2.988 1,048,576.99 0.0% 5.50 ObfuscationBench

879,916.05 1,136.47 0.0% 9,437,186.42 3,157,995.25 2.988 1,048,576.99 0.0% 5.50 ObfuscationBench

2daaa83 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

51,665.69 19,355.21 0.0% 327,684.05 185,409.22 1.767 65,536.55 0.0% 5.50 ObfuscationBench

52,520.96 19,040.02 0.2% 327,684.05 188,497.60 1.738 65,536.55 0.0% 5.51 ObfuscationBench

c69f744 experiment: aligned xor only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

51,601.83 19,379.16 0.0% 327,687.85 185,151.71 1.770 65,536.95 0.0% 5.50 ObfuscationBench

51,952.63 19,248.30 0.1% 327,687.85 186,403.90 1.758 65,536.95 0.0% 5.50 ObfuscationBench

9472af6 experiment: unroll only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

46,341.13 21,579.10 0.0% 475,139.45 166,272.12 2.858 81,920.35 0.0% 5.50 ObfuscationBench

45,936.99 21,768.95 0.0% 475,139.45 164,853.63 2.882 81,920.35 0.0% 5.50 ObfuscationBench

e9fd88b experiment: unroll + alignment

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

37,281.01 26,823.31 0.1% 475,139.64 133,808.24 3.551 81,920.54 0.0% 5.35 ObfuscationBench

37,652.77 26,558.47 0.1% 475,139.64 135,132.04 3.516 81,920.54 0.0% 5.35 ObfuscationBench

LLVM Clang20 is 2.15x faster with alignment & 64 byte unroll compared to just a 64 bit xor

------ C++ compiler .......................... Clang 20.1.6 ------

rm -rf build && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ >/dev/null 2>&1 && cmake --build build -j$(nproc) >/dev/null 2>&1 \ && build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \ && build/bin/bench_bitcoin -filter='ObfuscationBench' -min-time=5000 \ && git log -1 --pretty=format:'%h %s'

00d77f4 refactor: unify xor vs obfuscation nomenclature

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

926,379.31 1,079.47 0.1% 6,815,747.36 3,325,871.40 2.049 524,289.23 0.0% 5.50 ObfuscationBench

926,608.63 1,079.20 0.0% 6,815,747.36 3,326,572.88 2.049 524,289.23 0.0% 5.50 ObfuscationBench

2daaa83 optimization: migrate fixed-size obfuscation from std::vector<std::byte> to uint64_t

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

74,817.84 13,365.80 0.0% 655,366.68 268,566.88 2.440 131,074.08 0.0% 5.50 ObfuscationBench

75,351.98 13,271.05 0.1% 655,366.68 270,405.84 2.424 131,074.08 0.0% 5.51 ObfuscationBench

c69f744 experiment: aligned xor only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

38,396.57 26,043.99 0.1% 294,929.64 137,732.80 2.141 32,771.94 0.0% 5.35 ObfuscationBench

38,151.47 26,211.31 0.2% 294,929.64 136,877.12 2.155 32,771.94 0.0% 5.38 ObfuscationBench

9472af6 experiment: unroll only

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

35,556.53 28,124.23 0.1% 262,148.44 127,610.00 2.054 16,384.64 0.0% 5.32 ObfuscationBench

36,409.31 27,465.50 0.1% 262,148.44 130,650.29 2.006 16,384.64 0.0% 5.33 ObfuscationBench

e9fd88b experiment: unroll + alignment

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

34,770.15 28,760.30 0.0% 262,149.44 124,756.60 2.101 16,384.83 0.0% 5.32 ObfuscationBench

36,214.90 27,612.94 0.0% 262,149.44 129,932.72 2.018 16,384.84 0.0% 5.33 ObfuscationBench

Combining unrolling with aligning readjusts the speed for Apple Clang (which was already doing these automatically so there's no speed difference because of the latest commit) and brings Clang and GCC closer to the max bus speed.

Clang needed your alignment to achieve that, and GCC needed the unrolled 64-byte XOR to take advantage of SIMD.
Added this as a separate commit with you as coauthor, explaining the extra optimizations + updated godbolt and table from

Target & Compiler Stride (per iter) Main operation(s) in loop Effective XORs

Clang x86-64 (trunk) 32 bytes two unaligned 128-bit loads → pxor with broadcast key → stores 4 × 64-bit

GCC x86-64 (trunk) 16 bytes two scalar 64-bit xor reg-mem instructions directly on the target 2 × 64-bit

GCC RV32 (trunk) 8 bytes copy 8 B to stack scratch → XOR via two 32-bit regs → copy back 1 × 64-bit

GCC s390x (big-endian 14.2) 64 bytes eight XC ops using key cached on stack 8 × 64-bit

to:

Target & Compiler Stride (per hot-loop iter) Main operation(s) in loop Effective XORs / iter

Clang x86-64 (trunk) 64 bytes 4 × movdqu → pxor → store (aligned after peel) 8 × 64-bit

GCC x86-64 (trunk) 64 bytes same 4 × movdqu/pxor sequence, enabled by 8-way unroll 8 × 64-bit

GCC RV32 (trunk) 8 bytes copy 8 B to temp stack → 2 × 32-bit XOR → copy back 1 × 64-bit (as 2 × 32-bit)

GCC s390x (big-endian 14.2) 64 bytes 8 × XC (mem-mem 8-B XOR) with key cached on stack 8 × 64-bit

There was a build failure on 32 bit systems because of the alignment changes - fixed it by moving the std::assume_aligned check inside the alignment condition itself.
Remeasured the benchmarks (all 3 compilers are the same as the previous push) and redid the endianness check (as described in the PR) and updated the https://godbolt.org/z/35nveanf5 again.

src/obfuscation.h

hodlinator · 2025-05-22T07:40:43Z

src/obfuscation.h

+        const uint64_t rot_key{m_rotations[key_offset_bytes % SIZE_BYTES]}; // Continue obfuscation from where we left off
+        for (; target.size() >= SIZE_BYTES; target = target.subspan(SIZE_BYTES)) { // Process multiple bytes at a time
+            Xor(target, rot_key, SIZE_BYTES);
+        }


do you see an explanation in the assembly why it would be faster?

Didn't check the assembly yet, do you mind sharing your workflow for that? I was using a combination of objdump with a bunch of flags + searching in less for the consteval hex stuff ~9 months ago but suspect Obfuscation-methods get heavily inlined?

Re-measured with Clang to make sure I wasn't running the wrong build, but can still reproduce the result:

At tip of PR 989537f:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

73,504.93 13,604.53 0.1% 655,366.68 270,831.04 2.420 131,074.07 0.0% 10.66 ObfuscationBench

With diff from #31144 (comment) applied:

ns/MiB MiB/s err% ins/MiB cyc/MiB IPC bra/MiB miss% total benchmark

37,501.52 26,665.59 0.9% 294,930.14 138,093.33 2.136 32,771.94 0.0% 11.03 ObfuscationBench

xor_bytes_reference randomly generates where the next offset will be, check it by adding:

That only tests with an offset into the key, not an offset into the target, right?

I'm thinking something like:

diff --git a/src/test/streams_tests.cpp b/src/test/streams_tests.cpp index 1985b24663..2e10792710 100644 --- a/src/test/streams_tests.cpp +++ b/src/test/streams_tests.cpp @@ -61,6 +61,7 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference) for (size_t test{0}; test < 100; ++test) { const size_t write_size{1 + m_rng.randrange(100U)}; const size_t key_offset{m_rng.randrange(3 * 8U)}; // Should wrap around + const size_t write_offset{std::min(write_size, m_rng.randrange(Obfuscation::SIZE_BYTES * 2))}; std::array<std::byte, Obfuscation::SIZE_BYTES> key_bytes{}; if (m_rng.randbool()) m_rng.fillrand(key_bytes); @@ -68,8 +69,8 @@ BOOST_AUTO_TEST_CASE(xor_bytes_reference) std::vector expected{m_rng.randbytes<std::byte>(write_size)}; std::vector actual{expected}; - expected_xor(expected, key_bytes, key_offset); - obfuscation(actual, key_offset); + expected_xor(std::span{expected}.subspan(write_offset), key_bytes, key_offset); + obfuscation(std::span{actual}.subspan(write_offset), key_offset); BOOST_CHECK_EQUAL_COLLECTIONS(expected.begin(), expected.end(), actual.begin(), actual.end()); }

It's super unlikely that one would start at an unaligned offset like that, but still more complete to test it.

src/bench/xor.cpp

Since production code only uses keys of length 8, we're not testing with other values anymore Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>

Since a previous PR already solved the tiny byte-array xors during serialization, we're only concentrating on big continuous chunks now. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 879,957.05 | 1,136.42 | 0.0% | 9,437,186.42 | 3,158,477.41 | 2.988 | 1,048,576.99 | 0.0% | 5.50 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 926,379.31 | 1,079.47 | 0.1% | 6,815,747.36 | 3,325,871.40 | 2.049 | 524,289.23 | 0.0% | 5.50 | `ObfuscationBench`

This simplifies the next commits.

Since `CDBWrapper::Read` still supports vector only, we can initialize `m_obfuscation` directly instead of using a separate helper. `CreateObfuscation` was also inlined, replaced `key_exists` with `key_missing`, and simplified the `if` condition that writes a new obfuscation key. Since we're already throwing via `HandleError` here, the obfuscation key is also validated by throwing if invalid.

DrahtBot · 2025-05-23T14:15:42Z

🚧 At least one of the CI tasks failed.
_{Task MSan, depends: https://github.com/bitcoin/bitcoin/runs/42787791556}
_{LLM reason (✨ experimental): The CI failure is due to a failed assertion related to memory alignment during streams_tests.}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

…yte>` to `uint64_t` Since `util::Xor` now operates on `uint64_t` keys, migrated the obfuscation key end‑to‑end - from use sites in util::Xor up through streams, LevelDB wrappers, and disk (de)serialization - replacing all former `std::vector<std::byte>` keys with `uint64_t` (we still serialize them as vectors but convert immediately to `uint64_t` on load). This is why tests still generate vector keys and convert them to `uint64_t` later instead of generating them directly. We also short‑circuit `Xor` calls when the key is zero to avoid unnecessary calculations (e.g., `MakeWritableByteSpan`). In `Obfuscation::Unserialize` we can safely throw an `std::logic_error` since during mempool fuzzing `mempool_persist.cpp#L141` catches and ignored these errors safely. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 51,665.69 | 19,355.21 | 0.0% | 327,684.05 | 185,409.22 | 1.767 | 65,536.55 | 0.0% | 5.50 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 74,817.84 | 13,365.80 | 0.0% | 655,366.68 | 268,566.88 | 2.440 | 131,074.08 | 0.0% | 5.50 | `ObfuscationBench` Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>

Benchmarks indicated that processing 8-byte words instead of bytes already gives an order of magnitude speed-up, but: * GCC still emitted scalar code; * Clang’s auto-vectorised loop ran on the slow unaligned-load path. Fix contains: * peeling the mis-aligned head enabled the hot loop starting at an 8-byte address; * `std::assume_aligned<8>` tells the optimiser the promise holds - required to keep Apple Clang happy; * manually unrolling the body to 64 bytes enabled GCC to auto-vectorise. > C++ compiler .......................... GNU 14.2.0 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 37,281.01 | 26,823.31 | 0.1% | 475,139.64 | 133,808.24 | 3.551 | 81,920.54 | 0.0% | 5.35 | `ObfuscationBench` > C++ compiler .......................... Clang 20.1.6 | ns/MiB | MiB/s | err% | ins/MiB | cyc/MiB | IPC | bra/MiB | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 34,770.15 | 28,760.30 | 0.0% | 262,149.44 | 124,756.60 | 2.101 | 16,384.83 | 0.0% | 5.32 | `ObfuscationBench` Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>

hodlinator

re-ACK 63854e8

While the code is made less straightforward due to optimizations, I think it's still worth doing it to minimize overhead from obfuscation.

Changes since first ACK (#31144 (review)):

Changed commit messages to include benchmark results from more recent GCC and newer vanilla instead of Apple-flavoured Clang.
Modified unit test to add verification of obfuscation starting at non-aligned offsets as suggested.
Renamed benchmark file as suggested.
Removed ReadBlockBench and ReadRawBlockBench results from commit message (verified to still improve at the end of this review).
Added new commit at the end with further optimization partially inspired by my suggestion ([IBD] multi-byte block obfuscation #31144 (comment))!
- Adapts code to help compiler with loop unrolling and SIMD.
- Instead of having *reinterpret_cast<uint64_t*>(target.data()) ^= rot_key, the unrolling allows for using Obfuscation::Xor() with memcpys back & forth and still have compilers optimize effectively.
- I like how this version calculates the number of initial non-aligned bytes.
- std::assume_aligned - nice find!

Testing

Guix + Windows

Was curious about Windows performance, so I patched the Guix build to include bench_bitcoin.exe.

Patch

diff --git a/contrib/guix/libexec/build.sh b/contrib/guix/libexec/build.sh
index ae98eba744..25191f7e24 100755
--- a/contrib/guix/libexec/build.sh
+++ b/contrib/guix/libexec/build.sh
@@ -206,7 +206,7 @@ mkdir -p "$OUTDIR"
 ###########################
 
 # CONFIGFLAGS
-CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=OFF -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
+CONFIGFLAGS="-DREDUCE_EXPORTS=ON -DBUILD_BENCH=ON -DBUILD_GUI_TESTS=OFF -DBUILD_FUZZ_BINARY=OFF"
 
 # CFLAGS
 HOST_CFLAGS="-O2 -g"
diff --git a/src/bench/CMakeLists.txt b/src/bench/CMakeLists.txt
index 3003115068..56f78ab21d 100644
--- a/src/bench/CMakeLists.txt
+++ b/src/bench/CMakeLists.txt
@@ -58,6 +58,8 @@ target_raw_data_sources(bench_bitcoin NAMESPACE benchmark::data
   data/block413567.raw
 )
 
+add_windows_application_manifest(bench_bitcoin)
+
 target_link_libraries(bench_bitcoin
   core_interface
   test_util

Result on Windows Guix build show we are in the same ballpark as results in commit message (63854e8):

> bench_bitcoin.exe -filter=ObfuscationBench -min-time=10000

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           37,787.80 |           26,463.57 |    0.7% |     11.22 | `ObfuscationBench`

Result on Linux Guix build is similar as Windows, also those in commit messages and measurements with my suggested optimization:

₿ ./bench_bitcoin -filter=ObfuscationBench -min-time=10000

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           36,053.83 |           27,736.30 |    0.4% |      475,140.24 |      132,841.83 |  3.577 |      81,920.74 |    0.0% |     10.96 | `ObfuscationBench`

Prior to running doas perf system tune, it actually ran significantly faster (probably due to some unreliable CPU boosting logic):

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           29,425.58 |           33,984.04 |    0.8% |      475,140.23 |      107,611.98 |  4.415 |      81,920.73 |    0.0% |     11.04 | `ObfuscationBench`

ReadBlockBench + ReadRawBlockBench

Verified that related benchmarks still improved (non-Guix, NixOS, compiler Clang 20.1.3):

Before (2df824f, PR base):

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        5,440,481.19 |              183.81 |    0.1% |   61,834,466.65 |   19,827,449.24 |  3.119 |   3,028,151.40 |    0.4% |     10.98 | `ReadBlockBench`
|          919,632.97 |            1,087.39 |    0.0% |    6,511,624.89 |    3,187,581.46 |  2.043 |     502,474.87 |    0.0% |     11.00 | `ReadRawBlockBench`

After (63854e8, PR tip):

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        4,649,147.57 |              215.09 |    0.1% |   55,576,004.70 |   16,919,942.23 |  3.285 |   2,542,138.60 |    0.5% |     11.03 | `ReadBlockBench`
|           87,733.01 |           11,398.22 |    0.2% |      262,263.03 |      116,898.39 |  2.244 |      18,154.03 |    0.1% |     10.80 | `ReadRawBlockBench`

yuvicc · 2025-06-09T08:45:50Z

Concept ACK

Will benchmark this soon!

yuvicc

Code-Review & Tested ACK 63854e8

The code lgtm, thanks to @hodlinator reviews.

Regarding Tests

x86_64 ubuntu 24.04.2

C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          107,903.53 |            9,267.54 |    0.1% |      475,140.31 |      452,373.92 |  1.050 |      81,920.81 |    0.0% |     10.99 | `ObfuscationBench`


C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          107,881.57 |            9,269.42 |    0.1% |      475,140.31 |      452,243.73 |  1.051 |      81,920.81 |    0.0% |     11.00 | `ObfuscationBench`

MacOS Sequoia 15.3.2

C++ compiler ................................................................... Clang 16.0.0
|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           22,649.25 |           44,151.57 |    4.7% |     10.80 | `ObfuscationBench`


C++ compiler ................................................................... Clang 18.1.7
|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           21,154.05 |           47,272.27 |    0.1% |     10.99 | `ObfuscationBench`

commit 3959a90

C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          117,618.83 |            8,502.04 |    0.1% |      655,362.22 |      494,070.94 |  1.326 |     131,072.42 |    0.0% |      5.51 | `ObfuscationBench`

C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          113,280.61 |            8,827.64 |    0.1% |      327,683.32 |      475,761.88 |  0.689 |      65,536.51 |    0.0% |      5.50 | `ObfuscationBench`

Commit be3435c

C++  compiler ................................................................... Clang 18.1.3

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          788,559.17 |            1,268.14 |    0.1% |    6,815,747.22 |    3,366,204.35 |  2.025 |     524,289.09 |    0.0% |      5.50 | `ObfuscationBench`

C++ compiler ................................................................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          604,312.84 |            1,654.77 |    0.1% |    9,437,186.04 |    2,579,119.63 |  3.659 |   1,048,576.71 |    0.0% |      5.51 | `ObfuscationBench`

I will be doing IBD test on my machine and share the results soon!

maflcko · 2025-06-16T10:17:31Z

For other benchmark speedups see https://corecheck.dev/bitcoin/bitcoin/pulls/31144

I don't think this list is complete. At least for me it is truncated, due to the benchmark rename in the early commit. I tried to fix it in corecheck/corecheck#90, but I am not sure how to re-run it here. Maybe rebase?

l0rinc force-pushed the l0rinc/optimize-xor branch from b76314a to 10b9e68 Compare October 24, 2024 07:59

DrahtBot added the CI failed label Oct 24, 2024

maflcko reviewed Oct 24, 2024

View reviewed changes

src/streams.h Outdated Show resolved Hide resolved

src/streams.h Outdated Show resolved Hide resolved

l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 23f4809 to 6ae466b Compare October 24, 2024 09:59

laanwj reviewed Oct 24, 2024

View reviewed changes

src/streams.h Outdated Show resolved Hide resolved

laanwj reviewed Oct 24, 2024

View reviewed changes

src/streams.h Outdated Show resolved Hide resolved

DrahtBot removed the CI failed label Oct 24, 2024

l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 353915b to a3dc138 Compare October 24, 2024 12:20

l0rinc changed the title ~~optimization: pack util::Xor into 64 bit chunks instead of doing it byte-by-byte~~ optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte Oct 24, 2024

l0rinc marked this pull request as ready for review October 24, 2024 14:22

laanwj added the Block storage label Oct 24, 2024

hodlinator reviewed Oct 25, 2024

View reviewed changes

src/test/streams_tests.cpp Show resolved Hide resolved

src/test/streams_tests.cpp Outdated Show resolved Hide resolved

src/test/streams_tests.cpp Outdated Show resolved Hide resolved

l0rinc marked this pull request as draft October 25, 2024 18:47

l0rinc force-pushed the l0rinc/optimize-xor branch 3 times, most recently from 95e38ac to cd61e8b Compare October 26, 2024 21:58

DrahtBot added the CI failed label Oct 26, 2024

l0rinc force-pushed the l0rinc/optimize-xor branch from cd61e8b to d72968f Compare October 26, 2024 22:17

l0rinc changed the title ~~optimization: pack util::Xor into 64/32 bit chunks instead of doing it byte-by-byte~~ optimization: change XOR obfuscation key from std::vector<std::byte>(8) to uint64_t Oct 26, 2024

DrahtBot mentioned this pull request Oct 27, 2024

scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] #29641

Draft

l0rinc force-pushed the l0rinc/optimize-xor branch from d72968f to 555f8c8 Compare October 27, 2024 15:17

DrahtBot removed the CI failed label Oct 27, 2024

l0rinc requested review from hodlinator and sipa May 1, 2025 11:47

hodlinator reviewed May 8, 2025

View reviewed changes

l0rinc mentioned this pull request May 8, 2025

contrib: add xor-blocks tool to obfuscate blocks directory #32451

Closed

l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 8a918bb to 9406c17 Compare May 9, 2025 15:06

hodlinator reviewed May 19, 2025

View reviewed changes

l0rinc force-pushed the l0rinc/optimize-xor branch 2 times, most recently from 2fdc2a7 to 989537f Compare May 19, 2025 15:00

hodlinator approved these changes May 20, 2025

View reviewed changes

hodlinator reviewed May 22, 2025

View reviewed changes

l0rinc and others added 5 commits May 23, 2025 15:48

test: compare util::Xor with randomized inputs against simple impl

f6689a7

Since production code only uses keys of length 8, we're not testing with other values anymore Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>

refactor: unify xor-vs-obfuscation nomenclature

2c8cb56

This simplifies the next commits.

refactor: prepare mempool_persist for obfuscation key change

f79e8b9

l0rinc force-pushed the l0rinc/optimize-xor branch from 989537f to 5627cd4 Compare May 23, 2025 13:49

DrahtBot added the CI failed label May 23, 2025

l0rinc and others added 2 commits May 23, 2025 17:00

l0rinc force-pushed the l0rinc/optimize-xor branch from 5627cd4 to 63854e8 Compare May 23, 2025 15:15

DrahtBot removed the CI failed label May 23, 2025

hodlinator approved these changes May 28, 2025

View reviewed changes

hodlinator mentioned this pull request May 28, 2025

build: Add resource file and manifest to bitcoin.exe #32634

Merged

yuvicc reviewed Jun 11, 2025

View reviewed changes

		} else if (version == MEMPOOL_DUMP_VERSION) {
		std::vector<std::byte> obfuscation{8};

ns/MiB	MiB/s	err%	total	benchmark
14,543.90	68,757.33	0.1%	11.01	`ObfuscationBench`
14,604.52	68,471.95	0.0%	11.01	`ObfuscationBench`
14,573.71	68,616.72	0.2%	11.04	`ObfuscationBench`

ns/MiB	MiB/s	err%	total	benchmark
19,552.98	51,143.11	0.1%	11.01	`ObfuscationBench`
19,529.72	51,204.01	0.2%	11.00	`ObfuscationBench`
20,046.53	49,883.94	0.1%	11.03	`ObfuscationBench`

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
51,601.83	19,379.16	0.0%	327,687.85	185,151.71	1.770	65,536.95	0.0%	5.50	`ObfuscationBench`
51,952.63	19,248.30	0.1%	327,687.85	186,403.90	1.758	65,536.95	0.0%	5.50	`ObfuscationBench`

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
46,341.13	21,579.10	0.0%	475,139.45	166,272.12	2.858	81,920.35	0.0%	5.50	`ObfuscationBench`
45,936.99	21,768.95	0.0%	475,139.45	164,853.63	2.882	81,920.35	0.0%	5.50	`ObfuscationBench`

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
38,396.57	26,043.99	0.1%	294,929.64	137,732.80	2.141	32,771.94	0.0%	5.35	`ObfuscationBench`
38,151.47	26,211.31	0.2%	294,929.64	136,877.12	2.155	32,771.94	0.0%	5.38	`ObfuscationBench`

ns/MiB	MiB/s	err%	ins/MiB	cyc/MiB	IPC	bra/MiB	miss%	total	benchmark
35,556.53	28,124.23	0.1%	262,148.44	127,610.00	2.054	16,384.64	0.0%	5.32	`ObfuscationBench`
36,409.31	27,465.50	0.1%	262,148.44	130,650.29	2.006	16,384.64	0.0%	5.33	`ObfuscationBench`

Target & Compiler	Stride (per iter)	Main operation(s) in loop	Effective XORs
Clang x86-64 (trunk)	32 bytes	two unaligned 128-bit loads → pxor with broadcast key → stores	4 × 64-bit
GCC x86-64 (trunk)	16 bytes	two scalar 64-bit xor reg-mem instructions directly on the target	2 × 64-bit
GCC RV32 (trunk)	8 bytes	copy 8 B to stack scratch → XOR via two 32-bit regs → copy back	1 × 64-bit
GCC s390x (big-endian 14.2)	64 bytes	eight XC ops using key cached on stack	8 × 64-bit

Target & Compiler	Stride (per hot-loop iter)	Main operation(s) in loop	Effective XORs / iter
Clang x86-64 (trunk)	64 bytes	4 × movdqu → pxor → store (aligned after peel)	8 × 64-bit
GCC x86-64 (trunk)	64 bytes	same 4 × movdqu/pxor sequence, enabled by 8-way unroll	8 × 64-bit
GCC RV32 (trunk)	8 bytes	copy 8 B to temp stack → 2 × 32-bit XOR → copy back	1 × 64-bit (as 2 × 32-bit)
GCC s390x (big-endian 14.2)	64 bytes	8 × XC (mem-mem 8-B XOR) with key cached on stack	8 × 64-bit

[IBD] multi-byte block obfuscation #31144

Are you sure you want to change the base?

[IBD] multi-byte block obfuscation #31144

Conversation

Uh oh!

Summary

Changes in testing, benchmarking and implementation

Assembly

Endianness

Measurements (micro benchmarks and full IBDs)

Uh oh!

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Benchmarks

GCC 14.2.1

Clang Clang 20.1.3

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment