-
Notifications
You must be signed in to change notification settings - Fork 37.4k
Replace current benchmarking framework with nanobench #18011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Strong concept ACK! Seems like a big improvement. Can you comment more on the 6 seconds claim? AFAIK each bench was supposed to target running for 1 second? Is this no longer required to reduce variance? Secondly -- and separately -- can you comment on how this might impact the need for something like #17375? Can we add better support for benchmarks where we want to run with different scaling params and output each trial to get a sense of the complexity? |
I calculate a good number of iterations based on the clock accuracy, then perform these iterations a few times and use the median to get rid of outliers. I found it actually to be more reliable with shorter runs, because there is less chance for random fluctuations to interfer. It is necessary though to disable frequency scaling etc (but this should be done with the old framework too anyways). This can be easily done with e.g pyperf Concerning #17375, nanobench can estimate complexity, but it requires a bit of code change: https://github.com/martinus/nanobench/blob/master/docs/reference.md#asymptotic-complexity |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
The lint check currently fails with this error:
I believe the reason is some key verification check at the end of ci/lint/06_script.sh, but I can't really say why this is failing |
Concept ACK The travis failure is
|
Would it be easy to hack in csv output that is somewhat similar to the existing output? The markdown table looks a little tricky to parse programmatically (though it could be done). For example, bitcoinperf (https://github.com/chaincodelabs/bitcoinperf) currently relies on this format. |
Ah, ok I'll fix this
I think it should be easy, in nanobench I already have CSV & JSON output format using mustache-like templates, so it's possible to create a custom format. I have not exposed this feature yet though in this PR. |
Strong concept ACK @martinus, thanks for your great contributions! Please keep them coming :) |
Does this library also does memory clobbers and barriers? (like google's [0] https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L307 |
I currently have I don't have clobberMemory yet, because I've never used it... What I'm also doing is I force the |
fbb36d2
to
be20853
Compare
Is it possible to make the nanobench include a submodule (like secp256k1 or univalue) so that it's easier for us to pull in updates from upstream? If you plan on adding new features to nanobench, that should help streamline review potentially. If you think that there will be Bitcoin specific changes made to the header that you wouldn't want to upstream, then I would leave it as you've done. |
Closed #17375 in favor of nanobench. When you have time would love assistance in making the asymptotic test introduced there nanobench compatible. Being able to run asymptotic tests on the code is going to be a huge help with advocating for mempool policy changes in the future (e.g., loosening descendants limit) once the epoch mempool work is completed. This also impacts other parts of the code (e.g., wallet) where right now we don't have good insight into if we introduce a regression. I think the curve fitting is a bit less useful because we do care about the constant factors too (e.g., if we're a O(n log n) v.s. O(n) but c > log n for max n), but it's quite nifty nonetheless. |
I think it should be possible, I need to read up how git-subtree works... I prefer if nanobench stays generic, and try to implement bitcoin's requirement in a generic way so it's usable by others too. So no separate repository if possible.
Sure, I'll have a look at #17375 when I have time! I can't predict how soon this is though. |
Haven't looked at the library itself yet, but Concept ACK on replacing the current framework. (I really dislike it) |
I don't think google benchmark is viable here. It's a large dependency, and you also need to use the gtest framework for this. It would be quite a big change. |
Concept ACK |
We discussed nanobench today in the IRC Meeting. There's seems to be general agreement that this is a nice idea, and that the current bench framework isn't perfect. Our current bench framework is actually based on Google's, and I think most people are opposed to linking google's whole thing. With respect to the question of if to subtree or not: let's ignore that for now, copied in is fine, and we can deal with that in the future if we require changes to nanobench or if there are new features in nanobench we want to incorporate. There's some concern about maintaining compatibility with existing tools. I think given that the output is fundamentally different from before (no longer reporting iterations and things like that) we can't have perfect parity. But perhaps we could:
I think most people would be satisfied with 3, as 2 can be done a script on 3's output and 1 can be done if someone has the itch for it. |
Thanks for the summary! I just read through the log here and think I can add a few clarifications:
It's not malware, not sure how I can help here :) I've created nanobench because I was annoyed at how difficult other benchmarking frameworks were to integrate into existing codebase because I don't like google test.
I have templating support in nanobench, and I can relatively easily add another output format that resembles the current output format closely. I need to do some improvements to the available data in nanobench, then I can use a template liks this to produce practically the same output as before:
Then I can e.g. print the markdown tables to stdout, and create e.g.
Google benchmark is quite simple: it has a fixed runtime that it wants to achieve, then finds out the number of iterations it needs to do to get there, then it measures the time for that. In nanobench I try to be a bit smarter: I find out the clocks accuracy first, and base the target runtime on that. Since clocks are nowadays very accurate (18ns or so on my machine), I can easily perform the measurement multiple times and use the median to get rid of outliers. The fast runtimes gives very repeatable measurements for stuff that's deterministic (e.g. SHA hashing). There I believe nanobench has a clear advantage over all other bencharking frameworks that I've tried. When the code under test has fluctuations (e.g. because it allocates stuff, or has some randomness or lots of cache misses / branch misspredictions in it), nanobench's runtime measurement probably isn't better than google benchmark. In that case it helps to also show the numbers for branch misses and retired instruction count to get a better feeling. |
👍 per my ACK a few months ago #18011 (review) |
Thanks for merging! If anyone has any questions or issue with nanobench please notify me |
…bench 78c312c Replace current benchmarking framework with nanobench (Martin Ankerl) Pull request description: Replace current benchmarking framework with nanobench This replaces the current benchmarking framework with nanobench [1], an MIT licensed single-header benchmarking library, of which I am the autor. This has in my opinion several advantages, especially on Linux: * fast: Running all benchmarks takes ~6 seconds instead of 4m13s on an Intel i7-8700 CPU @ 3.20GHz. * accurate: I ran e.g. the benchmark for SipHash_32b 10 times and calculate standard deviation / mean = coefficient of variation: * 0.57% CV for old benchmarking framework * 0.20% CV for nanobench So the benchmark results with nanobench seem to vary less than with the old framework. * It automatically determines runtime based on clock precision, no need to specify number of evaluations. * measure instructions, cycles, branches, instructions per cycle, branch misses (only Linux, when performance counters are available) * output in markdown table format. * Warn about unstable environment (frequency scaling, turbo, ...) * For better profiling, it is possible to set the environment variable NANOBENCH_ENDLESS to force endless running of a particular benchmark without the need to recompile. This makes it to e.g. run "perf top" and look at hotspots. Here is an example copy & pasted from the terminal output: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 2.52 | 396,529,415.94 | 0.6% | 25.42 | 8.02 | 3.169 | 0.06 | 0.0% | 0.03 | `bench/crypto_hash.cpp RIPEMD160` | 1.87 | 535,161,444.83 | 0.3% | 21.36 | 5.95 | 3.589 | 0.06 | 0.0% | 0.02 | `bench/crypto_hash.cpp SHA1` | 3.22 | 310,344,174.79 | 1.1% | 36.80 | 10.22 | 3.601 | 0.09 | 0.0% | 0.04 | `bench/crypto_hash.cpp SHA256` | 2.01 | 496,375,796.23 | 0.0% | 18.72 | 6.43 | 2.911 | 0.01 | 1.0% | 0.00 | `bench/crypto_hash.cpp SHA256D64_1024` | 7.23 | 138,263,519.35 | 0.1% | 82.66 | 23.11 | 3.577 | 1.63 | 0.1% | 0.00 | `bench/crypto_hash.cpp SHA256_32b` | 3.04 | 328,780,166.40 | 0.3% | 35.82 | 9.69 | 3.696 | 0.03 | 0.0% | 0.03 | `bench/crypto_hash.cpp SHA512` [1] https://github.com/martinus/nanobench ACKs for top commit: laanwj: ACK 78c312c Tree-SHA512: 9e18770b18b6f95a7d0105a4a5497d31cf4eb5efe6574f4482f6f1b4c88d7e0946b9a4a1e9e8e6ecbf41a3f2d7571240677dcb45af29a6f0584e89b25f32e49e
@martinus How to conditionally skip a benchmark in nanobench framework (wrt #19710 (comment) and #19710 (comment))? |
You can use |
I mean in the code, e.g., skip |
Ah, of course Before benchmark is run you can do a check and then simply return, e.g. like so: static void CCheckQueueSpeedPrevectorJob(benchmark::Bench& bench)
{
if (GetNumCores() < 2) {
return;
} But that needs a little update in if (!bench.results().empty()) {
benchmarkResults.push_back(bench.results().back());
} |
Not sure if this is caused by this pull, but some benchmarks changed performance quite significantly:
Is this due to compiler optimizations or something else? Funnily the trig dummy bench went up by ~100%: https://codespeed.bitcoinperf.com/timeline/#/?exe=3,4,2,1,5&base=1+23&ben=micro.clang.Trig&env=1&revs=200&equid=off&quarts=on&extr=on |
Summary: ``` This replaces the current benchmarking framework with nanobench [1], an MIT licensed single-header benchmarking library, of which I am the autor. This has in my opinion several advantages, especially on Linux: * fast: Running all benchmarks takes ~6 seconds instead of 4m13s on an Intel i7-8700 CPU @ 3.20GHz. * accurate: I ran e.g. the benchmark for SipHash_32b 10 times and calculate standard deviation / mean = coefficient of variation: * 0.57% CV for old benchmarking framework * 0.20% CV for nanobench So the benchmark results with nanobench seem to vary less than with the old framework. * It automatically determines runtime based on clock precision, no need to specify number of evaluations. * measure instructions, cycles, branches, instructions per cycle, branch misses (only Linux, when performance counters are available) * output in markdown table format. * Warn about unstable environment (frequency scaling, turbo, ...) * For better profiling, it is possible to set the environment variable NANOBENCH_ENDLESS to force endless running of a particular benchmark without the need to recompile. This makes it to e.g. run "perf top" and look at hotspots. Here is an example copy & pasted from the terminal output: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 2.52 | 396,529,415.94 | 0.6% | 25.42 | 8.02 | 3.169 | 0.06 | 0.0% | 0.03 | `bench/crypto_hash.cpp RIPEMD160` | 1.87 | 535,161,444.83 | 0.3% | 21.36 | 5.95 | 3.589 | 0.06 | 0.0% | 0.02 | `bench/crypto_hash.cpp SHA1` | 3.22 | 310,344,174.79 | 1.1% | 36.80 | 10.22 | 3.601 | 0.09 | 0.0% | 0.04 | `bench/crypto_hash.cpp SHA256` | 2.01 | 496,375,796.23 | 0.0% | 18.72 | 6.43 | 2.911 | 0.01 | 1.0% | 0.00 | `bench/crypto_hash.cpp SHA256D64_1024` | 7.23 | 138,263,519.35 | 0.1% | 82.66 | 23.11 | 3.577 | 1.63 | 0.1% | 0.00 | `bench/crypto_hash.cpp SHA256_32b` | 3.04 | 328,780,166.40 | 0.3% | 35.82 | 9.69 | 3.696 | 0.03 | 0.0% | 0.03 | `bench/crypto_hash.cpp SHA512` [1] https://github.com/martinus/nanobench * Adds support for asymptotes This adds support to calculate asymptotic complexity of a benchmark. This is similar to #17375, but currently only one asymptote is supported, and I have added support in the benchmark `ComplexMemPool` as an example. Usage is e.g. like this: ./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800 This runs the benchmark `ComplexMemPool` several times but with different complexityN settings. The benchmark can extract that number and use it accordingly. Here, it's used for `childTxs`. The output is this: | complexityN | ns/op | op/s | err% | ins/op | cyc/op | IPC | total | benchmark |------------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|----------:|:---------- | 25 | 1,064,241.00 | 939.64 | 1.4% | 3,960,279.00 | 2,829,708.00 | 1.400 | 0.01 | `ComplexMemPool` | 50 | 1,579,530.00 | 633.10 | 1.0% | 6,231,810.00 | 4,412,674.00 | 1.412 | 0.02 | `ComplexMemPool` | 100 | 4,022,774.00 | 248.58 | 0.6% | 16,544,406.00 | 11,889,535.00 | 1.392 | 0.04 | `ComplexMemPool` | 200 | 15,390,986.00 | 64.97 | 0.2% | 63,904,254.00 | 47,731,705.00 | 1.339 | 0.17 | `ComplexMemPool` | 400 | 69,394,711.00 | 14.41 | 0.1% | 272,602,461.00 | 219,014,691.00 | 1.245 | 0.76 | `ComplexMemPool` | 600 | 168,977,165.00 | 5.92 | 0.1% | 639,108,082.00 | 535,316,887.00 | 1.194 | 1.86 | `ComplexMemPool` | 800 | 310,109,077.00 | 3.22 | 0.1% |1,149,134,246.00 | 984,620,812.00 | 1.167 | 3.41 | `ComplexMemPool` | coefficient | err% | complexity |--------------:|-------:|------------ | 4.78486e-07 | 4.5% | O(n^2) | 6.38557e-10 | 21.7% | O(n^3) | 3.42338e-05 | 38.0% | O(n log n) | 0.000313914 | 46.9% | O(n) | 0.0129823 | 114.4% | O(log n) | 0.0815055 | 133.8% | O(1) The best fitting curve is O(n^2), so the algorithm seems to scale quadratic with `childTxs` in the range 25 to 800. ``` Backport of core [[bitcoin/bitcoin#18011 | PR18011]]. Test Plan: ninja bench-bitcoin Reviewers: #bitcoin_abc, PiRK Reviewed By: #bitcoin_abc, PiRK Subscribers: PiRK Differential Revision: https://reviews.bitcoinabc.org/D9226
…bench 78c312c Replace current benchmarking framework with nanobench (Martin Ankerl) Pull request description: Replace current benchmarking framework with nanobench This replaces the current benchmarking framework with nanobench [1], an MIT licensed single-header benchmarking library, of which I am the autor. This has in my opinion several advantages, especially on Linux: * fast: Running all benchmarks takes ~6 seconds instead of 4m13s on an Intel i7-8700 CPU @ 3.20GHz. * accurate: I ran e.g. the benchmark for SipHash_32b 10 times and calculate standard deviation / mean = coefficient of variation: * 0.57% CV for old benchmarking framework * 0.20% CV for nanobench So the benchmark results with nanobench seem to vary less than with the old framework. * It automatically determines runtime based on clock precision, no need to specify number of evaluations. * measure instructions, cycles, branches, instructions per cycle, branch misses (only Linux, when performance counters are available) * output in markdown table format. * Warn about unstable environment (frequency scaling, turbo, ...) * For better profiling, it is possible to set the environment variable NANOBENCH_ENDLESS to force endless running of a particular benchmark without the need to recompile. This makes it to e.g. run "perf top" and look at hotspots. Here is an example copy & pasted from the terminal output: | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 2.52 | 396,529,415.94 | 0.6% | 25.42 | 8.02 | 3.169 | 0.06 | 0.0% | 0.03 | `bench/crypto_hash.cpp RIPEMD160` | 1.87 | 535,161,444.83 | 0.3% | 21.36 | 5.95 | 3.589 | 0.06 | 0.0% | 0.02 | `bench/crypto_hash.cpp SHA1` | 3.22 | 310,344,174.79 | 1.1% | 36.80 | 10.22 | 3.601 | 0.09 | 0.0% | 0.04 | `bench/crypto_hash.cpp SHA256` | 2.01 | 496,375,796.23 | 0.0% | 18.72 | 6.43 | 2.911 | 0.01 | 1.0% | 0.00 | `bench/crypto_hash.cpp SHA256D64_1024` | 7.23 | 138,263,519.35 | 0.1% | 82.66 | 23.11 | 3.577 | 1.63 | 0.1% | 0.00 | `bench/crypto_hash.cpp SHA256_32b` | 3.04 | 328,780,166.40 | 0.3% | 35.82 | 9.69 | 3.696 | 0.03 | 0.0% | 0.03 | `bench/crypto_hash.cpp SHA512` [1] https://github.com/martinus/nanobench ACKs for top commit: laanwj: ACK 78c312c Tree-SHA512: 9e18770b18b6f95a7d0105a4a5497d31cf4eb5efe6574f4482f6f1b4c88d7e0946b9a4a1e9e8e6ecbf41a3f2d7571240677dcb45af29a6f0584e89b25f32e49e # Conflicts: # .appveyor.yml # contrib/devtools/copyright_header.py # doc/benchmarking.md # src/Makefile.bench.include # src/Makefile.test.include # src/bench/addrman.cpp # src/bench/base58.cpp # src/bench/bech32.cpp # src/bench/bench.cpp # src/bench/bench.h # src/bench/bench_bitcoin.cpp # src/bench/block_assemble.cpp # src/bench/ccoins_caching.cpp # src/bench/checkblock.cpp # src/bench/coin_selection.cpp # src/bench/crypto_hash.cpp # src/bench/duplicate_inputs.cpp # src/bench/examples.cpp # src/bench/gcs_filter.cpp # src/bench/hashpadding.cpp # src/bench/mempool_eviction.cpp # src/bench/mempool_stress.cpp # src/bench/prevector.cpp # src/bench/rollingbloom.cpp # src/bench/rpc_blockchain.cpp # src/bench/rpc_mempool.cpp # src/bench/verify_script.cpp # src/bench/wallet_balance.cpp # test/lint/lint-include-guards.sh
partial merge bitcoin#18011: Replace current benchmarking framework with nanobench
Replace current benchmarking framework with nanobench
This replaces the current benchmarking framework with nanobench [1], an
MIT licensed single-header benchmarking library, of which I am the
autor. This has in my opinion several advantages, especially on Linux:
fast: Running all benchmarks takes ~6 seconds instead of 4m13s on
an Intel i7-8700 CPU @ 3.20GHz.
accurate: I ran e.g. the benchmark for SipHash_32b 10 times and
calculate standard deviation / mean = coefficient of variation:
So the benchmark results with nanobench seem to vary less than with
the old framework.
It automatically determines runtime based on clock precision, no need
to specify number of evaluations.
measure instructions, cycles, branches, instructions per cycle,
branch misses (only Linux, when performance counters are available)
output in markdown table format.
Warn about unstable environment (frequency scaling, turbo, ...)
For better profiling, it is possible to set the environment variable
NANOBENCH_ENDLESS to force endless running of a particular benchmark
without the need to recompile. This makes it to e.g. run "perf top"
and look at hotspots.
Here is an example copy & pasted from the terminal output:
bench/crypto_hash.cpp RIPEMD160
bench/crypto_hash.cpp SHA1
bench/crypto_hash.cpp SHA256
bench/crypto_hash.cpp SHA256D64_1024
bench/crypto_hash.cpp SHA256_32b
bench/crypto_hash.cpp SHA512
[1] https://github.com/martinus/nanobench