Add ASM optimizations for MuHash3072 #19181

fjahr · 2020-06-05T20:22:29Z

Adds assembly optimizations for MuHash3072 which is used for the muhash calculation of the UTXO set hash.

src/crypto/muhash.cpp

DrahtBot · 2020-06-07T23:49:29Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	Sjors

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

Sjors · 2020-06-09T10:48:20Z

I'll post some bench results here after #19214 and when the SHA512 -> 256 switch is implemented.

laanwj · 2020-06-09T14:29:36Z

Awesome, thanks for working on this. Given how good compilers are nowadays it's somewhat surprising to me that there's still something to gain with implementing things manually in assembly without use of any special instruction sets.

Sjors · 2021-12-15T12:18:01Z

Concept ACK. Description needs an update. And this still needs a rebase.

fjahr · 2021-12-19T17:52:41Z

Thanks for the nudge @Sjors and @DrahtBot !

I have only done a plain rebase so far, i.e. I have not checked whether further optimizations may be interesting to look into. But while I ran the benchmarks again I noticed that it currently appears that this has slower benchmarks than master on my machine. mul and div operations appear to be almost 20% slower. I will try to test on some other machines soon but if someone else can give it a spin as well it would be a great help!

Results with this PR

$ src/bench/bench_bitcoin -filter=MuHash.* -min_time=1000

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            7,184.06 |          139,197.00 |    0.8% |      1.07 | `MuHash`
|            6,138.95 |          162,894.31 |    0.1% |      1.05 | `MuHashDiv`
|            6,161.80 |          162,290.16 |    0.7% |      1.05 | `MuHashMul`
|            1,016.96 |          983,326.67 |    0.3% |      1.08 | `MuHashPrecompute`

Results on master

$ src/bench/bench_bitcoin -filter=MuHash.* -min_time=1000

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            6,225.79 |          160,622.20 |    1.6% |      1.05 | `MuHash`
|            5,121.86 |          195,241.46 |    0.5% |      1.09 | `MuHashDiv`
|            5,173.32 |          193,299.36 |    1.5% |      1.11 | `MuHashMul`
|            1,023.41 |          977,127.42 |    0.7% |      1.08 | `MuHashPrecompute`

maflcko · 2021-12-20T17:00:42Z

With gcc 11.2.0 on Cortex-A72.

This:

src/bench/bench_bitcoin -filter=MuHash.* 
Warning, results might be unstable:
* CPU frequency scaling enabled: CPU 0 between 600.0 and 1,500.0 MHz
* CPU governor is 'ondemand' but should be 'performance'
* Turbo is enabled, CPU frequency will fluctuate

Recommendations
* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           27,412.56 |           36,479.63 |    0.1% |      0.01 | `MuHash`
|           24,505.61 |           40,806.99 |    0.0% |      0.01 | `MuHashDiv`
|           24,513.39 |           40,794.02 |    0.0% |      0.01 | `MuHashMul`
|            2,874.73 |          347,859.09 |    0.0% |      0.01 | `MuHashPrecompute`

Master:

src/bench/bench_bitcoin -filter=MuHash.* 
Warning, results might be unstable:
* CPU frequency scaling enabled: CPU 0 between 600.0 and 1,500.0 MHz
* CPU governor is 'ondemand' but should be 'performance'
* Turbo is enabled, CPU frequency will fluctuate

Recommendations
* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           27,451.77 |           36,427.52 |    0.0% |      0.01 | `MuHash`
|           24,524.13 |           40,776.16 |    0.1% |      0.01 | `MuHashDiv`
|           24,494.58 |           40,825.36 |    0.0% |      0.01 | `MuHashMul`
|            2,853.94 |          350,392.61 |    0.0% |      0.01 | `MuHashPrecompute`

sipa · 2021-12-20T17:04:26Z

@MarcoFalke Well, that's expected at least. There are only x86_64 asm implementations here.

@fjahr What kind of system/compiler/flags?

sipa · 2021-12-20T17:11:25Z

Benchmark on Ryzen 5950X, Ubuntu 21.10, GCC 11.2.0, default configure flags.

master:

ns/op	op/s	err%	total	benchmark
4,133.94	241,900.24	0.5%	0.01	`MuHash`
3,530.11	283,277.63	0.4%	0.01	`MuHashDiv`
3,522.44	283,894.43	0.2%	0.01	`MuHashMul`
554.91	1,802,107.77	0.1%	0.01	`MuHashPrecompute`

This PR:

ns/op	op/s	err%	total	benchmark
3,157.98	316,658.14	0.3%	0.01	`MuHash`
2,597.48	384,987.98	0.1%	0.01	`MuHashDiv`
2,597.10	385,044.86	0.2%	0.01	`MuHashMul`
556.71	1,796,271.37	0.0%	0.01	`MuHashPrecompute`

maflcko · 2021-12-21T09:21:17Z

@MarcoFalke Well, that's expected at least. There are only x86_64 asm implementations here.

I wish I could read asm. 😅

fjahr · 2021-12-21T21:43:02Z

@fjahr What kind of system/compiler/flags?

Hmm, this was with Intel Core i5-6287U, darwin 21.2.0 (macOS 12.1), clang 13.0.0, default configure flags except for skipping the GUI.

sipa · 2021-12-21T22:13:16Z

@fjahr Did you disable frequency scaling? Mobile CPUs by default will change frequency all the time in response to load, leading to unreliable benchmarks.

theStack

Results using gcc 11.3.0 and default configuration on a AMD EPYC 7702P (running Ubuntu 22.04):

master:

ns/op	op/s	err%	total	benchmark
6,810.64	146,829.16	2.5%	0.01	`MuHash`
5,947.59	168,135.40	1.0%	0.01	`MuHashDiv`
5,756.02	173,731.09	0.8%	0.01	`MuHashMul`
838.00	1,193,321.69	0.4%	0.01	`MuHashPrecompute`

PR:

ns/op	op/s	err%	total	benchmark
5,457.87	183,221.54	0.4%	0.01	`MuHash`
4,605.19	217,146.38	0.9%	0.01	`MuHashDiv`
4,615.33	216,669.07	0.7%	0.01	`MuHashMul`
848.51	1,178,540.85	0.7%	0.01	`MuHashPrecompute`

For a more practical test, I ran gettxoutset muhash without coinstatsindex on mainnet height 765184 both on master and the PR (started with -nolisten -noconnect to avoid distractions for the benchmark), showing a nice ~18% speedup:

master:

$ time ./src/bitcoin-cli gettxoutsetinfo muhash
{
  "height": 765184,
  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
  "txouts": 83195150,
  "bogosize": 6203273828,
  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
  "total_amount": 19219692.16624067,
  "transactions": 49805302,
  "disk_size": 5992277808
}

real    6m28.066s
user    0m0.003s
sys     0m0.001s

PR:

$ time ./src/bitcoin-cli gettxoutsetinfo muhash


{
  "height": 765184,
  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
  "txouts": 83195150,
  "bogosize": 6203273828,
  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
  "total_amount": 19219692.16624067,
  "transactions": 49805302,
  "disk_size": 5991699018
}

real    5m28.506s
user    0m0.003s
sys     0m0.000s

I seem to get consistently better results than master on GCC but also consistently worst results on clang.

Interesting, will als repeat the above test runs using clang in a bit to see if I can observe the same effect.

theStack

I seem to get consistently better results than master on GCC but also consistently worst results on clang.

I can confirm that the MuHash performance with this PR is worse than on master when compiling using clang 14.0.0 (default configuration on a AMD EPYC 7702P, running Ubuntu 22.04). Benchmark results:

master:

ns/op	op/s	err%	total	benchmark
5,476.23	182,607.35	1.8%	0.01	`MuHash`
4,570.74	218,782.88	1.0%	0.01	`MuHashDiv`
4,547.88	219,882.44	0.9%	0.01	`MuHashMul`
814.77	1,227,343.46	0.8%	0.01	`MuHashPrecompute`

PR:

ns/op	op/s	err%	total	benchmark
5,799.24	172,436.30	0.6%	0.01	`MuHash`
5,022.26	199,113.39	1.3%	0.01	`MuHashDiv`
4,954.25	201,847.07	0.3%	0.01	`MuHashMul`
828.04	1,207,671.75	0.8%	0.01	`MuHashPrecompute`

And the gettxoutsetinfo muhash tests for mainnet block 765184:

master:

$ time ./src/bitcoin-cli gettxoutsetinfo muhash
{
  "height": 765184,
  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
  "txouts": 83195150,
 
8000
 "bogosize": 6203273828,
  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
  "total_amount": 19219692.16624067,
  "transactions": 49805302,
  "disk_size": 5991686963
}

real    5m36.934s
user    0m0.002s
sys     0m0.002s

PR:

$ time ./src/bitcoin-cli gettxoutsetinfo muhash
{
  "height": 765184,
  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
  "txouts": 83195150,
  "bogosize": 6203273828,
  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
  "total_amount": 19219692.16624067,
  "transactions": 49805302,
  "disk_size": 5991686963
}

real    5m52.723s
user    0m0.004s
sys     0m0.000s

(Sorry for the long double-posts, I should summarize both on them on a small summary table)

Based on those results, if at all, those optimizations should probably only be enabled if GCC is used? (Not sure if it's worth the review and maintenance burden though.)

achow101 · 2023-04-25T15:37:15Z

@real-or-random

real-or-random · 2023-04-25T16:13:33Z

Based on those results, if at all, those optimizations should probably only be enabled if GCC is used? (Not sure if it's worth the review and maintenance burden though.)

Some of the code changes here conflict with #21590, and given that clang 14 is better than this ASM, we should maybe get benchmarks on a recent GCC (like 12.2) and see if it's faster. If not, I also wonder if there's a better way to convince GCC to output good code.

@theStack Can you re-benchmark with GCC 12 ? I can also try to get some numbers on my machine.

maflcko · 2023-04-26T14:19:13Z

Maybe even with gcc 13.1, now that it is about to be released?

fjahr · 2023-05-01T19:29:18Z

I finally got another amd64 machine with which I can test again. I confirmed the clang-14 results others have seen and I tested with GCC 13.1. The results still look good there for me but would be great if someone else could confirm.

(using src/bench/bench_bitcoin -filter=MuHash.* -min-time=1000)

Master:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            9,000.29 |          111,107.54 |    1.6% |      1.09 | `MuHash`
|            7,424.35 |          134,691.90 |    0.0% |      1.05 | `MuHashDiv`
|            7,428.86 |          134,610.13 |    0.2% |      1.06 | `MuHashMul`
|            1,058.97 |          944,314.00 |    0.0% |      1.08 | `MuHashPrecompute`

This PR:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            6,936.64 |          144,161.92 |    0.1% |      1.08 | `MuHash`
|            5,596.30 |          178,689.37 |    0.2% |      1.03 | `MuHashDiv`
|            5,422.52 |          184,415.94 |    0.7% |      1.08 | `MuHashMul`
|            1,061.63 |          941,950.39 |    0.0% |      1.09 | `MuHashPrecompute`

I guess it makes sense to only use this when GCC is used which should work with this

#if (defined(__amd64__) || defined(__x86_64__)) && defined(__GNUC__) && !defined(__clang__)

but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

I will also try to combine the changes here with #21590 to see what the combined impact is.

real-or-random · 2023-05-02T08:21:08Z

#if (defined(__amd64__) || defined(__x86_64__)) && defined(__GNUC__) && !defined(__clang__)

If it's that niche, it's a bit unclear to me whether it's worth the hassle. I feel we should look at #21590 first. I expect this to be a much larger improvement (and since it's algorithmic, it will apply to all targets). Perhaps we don't care about this optimization here so much after #21590.

fjahr · 2023-05-02T14:10:55Z

I have now tested SafeGCD vs SafeGCD+ASM (see https://github.com/fjahr/bitcoin/tree/pr21590-safegcd-asm) and the gains from including the ASM code are still substantial.

GCC 13.1 - SafeGCD only:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            9,168.73 |          109,066.35 |    2.1% |      1.06 | `MuHash`
|            7,571.75 |          132,069.88 |    0.2% |      1.05 | `MuHashDiv`
|           75,079.98 |           13,319.13 |    0.0% |      1.06 | `MuHashFinalize`

9E88
|            7,322.87 |          136,558.39 |    0.1% |      1.05 | `MuHashMul`
|            1,052.74 |          949,904.92 |    0.0% |      1.09 | `MuHashPrecompute`

GCC 13.1 - SafeGCD + ASM:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            6,924.98 |          144,404.76 |    0.2% |      1.08 | `MuHash`
|            5,631.55 |          177,571.08 |    0.1% |      1.09 | `MuHashDiv`
|           70,266.78 |           14,231.48 |    0.8% |      1.05 | `MuHashFinalize`
|            5,454.11 |          183,347.95 |    0.1% |      1.09 | `MuHashMul`
|            1,051.74 |          950,801.50 |    0.0% |      1.09 | `MuHashPrecompute`

@real-or-random while this change here is niche in terms of the tooling+architecture it targets, the SafeGCD change only impacts the finalize operation and that has a much smaller impact on the overall computation time because of how we use MuHash in practice. When a new block comes in we do a div of all the spent TXOs and a mul of all the new UTXOs and then at the end we finalize once. So I still have a hard time deciding which one is the more valuable change. Also, while ASM might be a bit intimidating, #21590 has 10x the LOC changed and requires some understanding of the underlying theory.

real-or-random · 2023-05-02T14:17:23Z

the SafeGCD change only impacts the finalize operation and that has a much smaller impact on the overall computation time because of how we use MuHash in practice. When a new block comes in we do a div of all the spent TXOs and a mul of all the new UTXOs and then at the end we finalize once.

Okay, thanks for the numbers. I agree then, we shouldn't neglect this one.

So I still have a hard time deciding which one is the more valuable change. Also, while ASM might be a bit intimidating, #21590 has 10x the LOC changed and requires some understanding of the underlying theory.

Yeah, they both have their merits then, and I don't think any should be prioritized over the other. Let's do (first) whatever PR we get the ACKs on. (And yes, I expect #21590 to be harder to review...)

but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

That would be ideal. I'll try to have a look at this.

real-or-random · 2023-05-02T16:05:51Z

but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

I played around with this a bit, and I don't see any obvious trick to make that work. If someone else wants to give it a try, https://gcc.godbolt.org/z/hhGfeEoKq could be a nice starting point.

DrahtBot · 2023-08-08T09:16:27Z

There hasn't been much activity lately. What is the status here?

Finding reviewers may take time. However, if the patch is no longer relevant, please close this pull request. If the author lost interest or time to work on this, please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

fjahr · 2023-08-08T15:02:11Z

There hasn't been much activity lately. What is the status here?

Finding reviewers may take time. However, if the patch is no longer relevant, please close this pull request. If the author lost interest or time to work on this, please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

Still relevant... How good is your ASM, @DrahtBot ? 😁

sipa · 2023-09-29T14:31:28Z

It's a rather unsatisfying situation that a compiler produces better code than hand-written assembly. One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

real-or-random · 2023-09-29T14:44:46Z

It's a rather unsatisfying situation that a compiler produces better code than hand-written assembly. One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

Yes, but what would be a proper way of reviewing such a PR? Just comparing with the clang output? If we think that's sufficient, then that's a possible way forward.

real-or-random · 2023-11-08T11:30:13Z

There's a lot of activity recently in GCC towards generating adc instructions on x86(_64): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 But the current GCC trunk (14) so far hasn't improved over GCC 13 for over code. The GCC trhead also has some hints at possible other implementations such as GCC builtins, which may be simpler to review.

I don't know. Checking the asm annotations is not trivial and not many people are familiar with inline asm. But on the other hand, I'm not saying that this PR is a huge review effort. It's just a handful of small functions, and someone needs to take the time.

achow101 · 2024-04-09T14:53:22Z

One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

Seems like we should do that.

achow101 · 2024-04-09T14:54:54Z

Closing as up for grabs due to lack of activity.

laanwj · 2024-04-09T15:02:28Z

One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

In general i really dislike the idea of copy/pasting assembly output from a compiler into the source code. It's already hard enough to review human-generated asm code but at least you can ask the author about the reasoning how and why. In the case of compiler output, well, it'd be a matter of waiting for gcc to improve 🙂

maflcko · 2024-04-09T15:06:27Z

Removing "up for grabs" for now. I don't think anyone will review asm, regardless of where it came from? If there are reviewers who would review it, they should speak up first, no?

laanwj · 2024-04-09T15:18:48Z

To be clear, I'm happy to review asm if there's 1) a very clear performance win in an important part of the code 2) it's human-written and well commented 3) it's only small and relatively straightforward, self-contained operations.
With how good compilers are nowadays it should be rare, though. With new instruction sets it's generally better (also for review-easier to check data flow) to use intrinsics instead of direct inline assembly.

This was referenced Jun 5, 2020

Add MuHash3072 implementation #19055

Merged

[WIP] Index for UTXO Set Statistics #18000

Closed

src/crypto/muhash.cpp Outdated Show resolved Hide resolved

ysangkok reviewed Jun 5, 2020

View reviewed changes

src/crypto/muhash.cpp Outdated Show resolved Hide resolved

DrahtBot added Build system Tests Utils/log/libs labels Jun 5, 2020

fjahr force-pushed the csi-4-muhash-asm branch from 722700a to abaf8c8 Compare June 7, 2020 22:03

DrahtBot mentioned this pull request Jun 8, 2020

Add hash_type MUHASH for gettxoutsetinfo #19145

Merged

fjahr force-pushed the csi-4-muhash-asm branch from abaf8c8 to 8ca82da Compare June 11, 2020 22:05

DrahtBot mentioned this pull request Jun 13, 2020

Replace current benchmarking framework with nanobench #18011

Merged

DrahtBot added the Needs rebase label Jul 30, 2020

fjahr force-pushed the csi-4-muhash-asm branch from 8ca82da to 915ef08 Compare December 19, 2021 17:40

DrahtBot removed the Needs rebase label Dec 19, 2021

DrahtBot mentioned this pull request Dec 20, 2021

Safegcd-based modular inverses in MuHash3072 #21590

Merged

maflcko removed Build system Tests Utils/log/libs labels Dec 20, 2021

DrahtBot added the Utils/log/libs label Dec 20, 2021

fjahr force-pushed the csi-4-muhash-asm branch from 915ef08 to 82d4c7e Compare November 20, 2022 00:10

theStack reviewed Nov 29, 2022

View reviewed changes

theStack reviewed Dec 1, 2022

View reviewed changes

achow101 requested review from theStack and sipa September 20, 2023 17:31

DrahtBot added the CI failed label Nov 27, 2023

DrahtBot removed the CI failed label Dec 10, 2023

DrahtBot added the CI failed label Jan 24, 2024

achow101 closed this Apr 9, 2024

achow101 added the Up for grabs label Apr 9, 2024

maflcko removed the Up for grabs label Apr 9, 2024

bitcoin locked and limited conversation to collaborators Apr 9, 2025

Add ASM optimizations for MuHash3072 #19181

Add ASM optimizations for MuHash3072 #19181

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Code Coverage

Reviews

Conflicts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!