-
Notifications
You must be signed in to change notification settings - Fork 608
chore: update to mimalloc 3.0.3 beta #7786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
!bench |
Mathlib CI status (docs):
|
!bench |
Here are the benchmark results for commit 5e45ad6. Benchmark Metric Change
==========================================================================
+ Init.Data.BitVec.Lemmas branches -1.2% (-96.3 σ)
+ Init.Data.BitVec.Lemmas maxrss -7.2% (-11.9 σ)
+ Init.Data.BitVec.Lemmas re-elab branches -1.1% (-53.5 σ)
+ Init.Data.BitVec.Lemmas re-elab maxrss -7.7% (-685.7 σ)
+ Init.Data.List.Sublist async branches -1.3% (-150.0 σ)
+ Init.Data.List.Sublist async maxrss -12.1% (-29.5 σ)
+ Init.Data.List.Sublist re-elab -j4 branches -1.8% (-15.9 σ)
+ Init.Data.List.Sublist re-elab -j4 maxrss -35.3% (-64.7 σ)
+ Init.Prelude async maxrss -8.7% (-22.5 σ)
+ Std.Data.DHashMap.Internal.RawLemmas branches -1.8% (-559.2 σ)
+ Std.Data.DHashMap.Internal.RawLemmas maxrss -4.5% (-11.1 σ)
+ Std.Data.Internal.List.Associative branches -1.8% (-91.1 σ)
- big_do instructions 1.5% (84.0 σ)
+ big_do maxrss -6.3% (-74.1 σ)
+ big_omega.lean maxrss -1.8% (-100.3 σ)
- big_omega.lean task-clock 5.3% (10.1 σ)
- big_omega.lean wall-clock 5.4% (13.0 σ)
+ big_omega.lean MT maxrss -1.6% (-363.5 σ)
- big_omega.lean MT task-clock 5.1% (12.6 σ)
- big_omega.lean MT wall-clock 5.1% (10.8 σ)
+ binarytrees instructions -2.3% (-870.4 σ)
+ binarytrees maxrss -38.9% (-348.6 σ)
+ binarytrees.st instructions -1.5% (-492.2 σ)
+ binarytrees.st maxrss -1.0%
- bv_decide_inequality.lean instructions 1.3% (14.7 σ)
- bv_decide_inequality.lean maxrss 5.3% (38.4 σ)
- bv_decide_large_aig.lean branch-misses 44.1% (41.1 σ)
- bv_decide_large_aig.lean branches 1.7% (16.9 σ)
- bv_decide_large_aig.lean instructions 2.5% (24.8 σ)
+ bv_decide_large_aig.lean maxrss -10.4% (-37.4 σ)
- bv_decide_mul maxrss 1.5% (33.2 σ)
+ bv_decide_realworld maxrss -12.1% (-24.2 σ)
- bv_decide_realworld task-clock 4.8% (11.3 σ)
- bv_decide_realworld wall-clock 4.5% (14.6 σ)
- const_fold instructions 23.9% (175.2 σ)
- const_fold task-clock 26.5% (24.0 σ)
- const_fold wall-clock 26.5% (23.9 σ)
- deriv instructions 36.0% (488.4 σ)
- deriv task-clock 52.7% (22.0 σ)
- deriv wall-clock 52.6% (22.0 σ)
- identifier auto-completion bran
8000
ches 4.1% (78.2 σ)
- identifier auto-completion instructions 5.2% (90.7 σ)
+ identifier auto-completion maxrss -4.2% (-55.6 σ)
- identifier auto-completion task-clock 10.7% (16.8 σ)
- ilean roundtrip branches 1.1% (30.2 σ)
- ilean roundtrip compress 11.2% (80.1 σ)
- ilean roundtrip instructions 2.6% (79.0 σ)
- ilean roundtrip task-clock 14.5% (14.1 σ)
- ilean roundtrip wall-clock 14.5% (14.2 σ)
- import Lean branches 5.1% (136.6 σ)
- import Lean instructions 6.3% (190.8 σ)
+ import Lean maxrss -1.1% (-14.4 σ)
- lake build clean instructions 4.5% (118.4 σ)
+ lake build clean maxrss -13.5% (-139.3 σ)
- lake build no-op instructions 2.5% (35.2 σ)
+ lake build no-op maxrss -7.5% (-95.1 σ)
- lake config elab instructions 2.5% (494.6 σ)
+ lake config elab maxrss -2.1% (-60.1 σ)
- lake config import instructions 6.4% (260.0 σ)
- lake config tree instructions 6.2% (167.7 σ)
- lake env instructions 6.6% (312.1 σ)
- lake env task-clock 20.8% (20.9 σ)
- lake env wall-clock 20.8% (21.0 σ)
- lake startup instructions 5.6% (114.5 σ)
- language server startup branches 3.8% (45.5 σ)
- language server startup instructions 4.9% (51.1 σ)
+ language server startup maxrss -4.1% (-121.9 σ)
+ nat_repr branches -2.4% (-450.5 σ)
+ nat_repr instructions -1.3% (-212.4 σ)
- omega_stress.lean async instructions 1.2% (45.7 σ)
+ omega_stress.lean async maxrss -6.1% (-32.2 σ)
- rbmap instructions 2.6% (384.3 σ)
+ rbmap maxrss -1.6%
- rbmap_1 instructions 57.8% (632.1 σ)
- rbmap_1 task-clock 50.2% (21.4 σ)
- rbmap_1 wall-clock 50.1% (21.5 σ)
- rbmap_10 instructions 18.6% (585.6 σ)
- rbmap_10 task-clock 28.5% (11.3 σ)
- rbmap_10 wall-clock 28.5% (11.2 σ)
- rbmap_fbip instructions 1.7% (149.9 σ)
+ rbmap_fbip maxrss -1.4% (-10.1 σ)
- rbmap_library instructions 3.0% (397.3 σ)
- reduceMatch instructions 4.1% (67.0 σ)
+ reduceMatch maxrss -9.7% (-139.8 σ)
- simp_arith1 branches 5.6% (459.8 σ)
- simp_arith1 instructions 6.3% (578.6 σ)
+ simp_arith1 maxrss -1.7% (-79.7 σ)
+ stdlib maxrss -3.5% (-59.1 σ)
+ tests/bench/ interpreted maxrss -2.0% (-107.5 σ)
+ unionfind instructions -1.7% (-16575.7 σ)
+ unionfind maxrss -14.3% (-143.0 σ)
- workspaceSymbols instructions 2.1% (277.8 σ)
+ workspaceSymbols maxrss -2.8% (-86.5 σ) |
@daanx I thought you might be interested in this data! The "Init.Data.List.Sublist re-elab -j4 maxrss -35.3%" is extremely encouraging for longer-running processes of our language server, though it looks like benchmarks that effectively spend most of their time in the allocator do see some significant regressions. |
Hi Sebastian! Hope you are doing well (and we should catch up some time! come to PLDI this year :-)). On my benchmark suite v3 is close to v2 in absolute performance as well -- but when I see the numbers above there is some serious regression on rbmap! Yikes. I wonder though if it is a matter of a bit more tuning -- like turning off purging or reducing page search etc. Is there a somewhat easy way for me to run (some of) these benchmarks and play with different settings for mimalloc? It would be great to figure out what is the cause of this. |
Just to add: what does "instructions" mean? Did mimalloc execute 57.8% more instructions for |
I might just do that :) . Looking forward to seeing you again!
This is CPU instructions as per
You can run the benchmark this way without building Lean from source, via elan:
That doesn't give you any control over mimalloc compile-time options of course. The code embedding it is here: lean4/src/runtime/CMakeLists.txt Lines 7 to 16 in bb6bfdb
static.c directly. I hope I'm not missing some important dev3 flags this way?
|
Great -- see you then!
Ah, that is interesting; that routine is the main new part that does more fine-grained memory sharing. I'll look into it more. (it might just be that I should optimize the initial allocation in pristine reserved memory) Weird to see the
Yes, that is a great way to build it as part of anothor project/runtime. Koka does this too. I don't think you would miss any flags (but ensure you defined |
…ially address perf regresion in mimalloc v3 with respect to v2 -- see also leanprover/lean4#7786
@Kha : I may have fixed some of the perf regression you saw in these tests (in particular |
!bench |
Here are the benchmark results for commit 484e3bc. |
!bench |
Here are the benchmark results for commit 8f92420. Benchmark Metric Change
========================================================================
+ Init.Data.BitVec.Lemmas maxrss -17.2% (-235.1 σ)
+ Init.Data.BitVec.Lemmas re-elab maxrss -7.0% (-44.0 σ)
+ Init.Data.List.Sublist async maxrss -23.5% (-37.2 σ)
+ Init.Data.List.Sublist re-elab -j4 maxrss -28.3% (-25.8 σ)
+ Init.Prelude async maxrss -18.0% (-95.6 σ)
+ Std.Data.DHashMap.Internal.RawLemmas maxrss -8.6% (-34.9 σ)
+ Std.Data.Internal.List.Associative maxrss -3.5% (-38.1 σ)
- big_do branches 1.1% (104.5 σ)
- big_do instructions 1.4% (197.4 σ)
+ big_do maxrss -3.4% (-97.8 σ)
+ big_omega.lean maxrss -2.0% (-27.9 σ)
- big_omega.lean task-clock 7.3% (217.7 σ)
- big_omega.lean wall-clock 7.5% (40.9 σ)
+ big_omega.lean MT maxrss -2.3% (-53.1 σ)
+ binarytrees instructions -1.3% (-29.7 σ)
- bv_decide_inequality.lean maxrss 2.0% (23.5 σ)
- bv_decide_large_aig.lean branch-misses 62.6% (385.6 σ)
- bv_decide_large_aig.lean branches 3.4% (87.3 σ)
- bv_decide_large_aig.lean instructions 3.7% (142.2 σ)
+ bv_decide_large_aig.lean maxrss -8.3% (-424.1 σ)
- bv_decide_mul maxrss 1.0% (20.4 σ)
+ channel.lean maxrss -24.3% (-24.9 σ)
- const_fold instructions 8.4% (55.4 σ)
- deriv instructions 10.3% (88.4 σ)
- identifier auto-completion branches 3.0% (25.5 σ)
- identifier auto-completion instructions 3.6% (27.1 σ)
+ identifier auto-completion maxrss -4.2% (-30.9 σ)
- identifier auto-completion task-clock 9.5% (50.1 σ)
- ilean roundtrip instructions 3.0% (24.2 σ)
- import Lean branches 5.9% (114.1 σ)
- import Lean instructions 6.8% (132.6 σ)
+ import Lean maxrss -1.4% (-53.5 σ)
- lake build clean instructions 5.5% (347.0 σ)
+ lake build clean maxrss -16.6% (-262.8 σ)
+ lake build no-op maxrss -14.5% (-34.1 σ)
- lake config elab instructions 4.1% (137.6 σ)
+ lake config elab maxrss -2.7% (-33.6 σ)
- lake config elab task-clock 8.7% (63.0 σ)
- lake config elab wall-clock 8.6% (55.6 σ)
- lake config import instructions 7.3% (90.9 σ)
+ lake config import maxrss -1.1% (-26.5 σ)
- lake config import task-clock 17.4% (60.5 σ)
- lake config import wall-clock 17.3% (60.0 σ)
- lake config tree instructions 6.9% (160.5 σ)
+ lake config tree maxrss -1.2% (-21.1 σ)
- lake config tree task-clock 13.9% (220.2 σ)
- lake config tree wall-clock 13.8% (191.6 σ)
- lake env instructions 7.4% (135.4 σ)
- lake env task-clock 13.4% (46.2 σ)
- lake env wall-clock 13.2% (50.9 σ)
- lake startup instructions 5.9% (247.5 σ)
- language server startup branches 4.7% (213.6 σ)
- language server startup instructions 5.4% (223.7 σ)
+ language server startup maxrss -3.8% (-110.6 σ)
+ omega_stress.lean async maxrss -9.5% (-61.8 σ)
- rbmap instructions 2.7% (269.5 σ)
- rbmap_1 instructions 20.4% (227.2 σ)
- rbmap_1 task-clock 42.3% (75.9 σ)
- rbmap_1 wall-clock 42.3% (76.3 σ)
- rbmap_10 instructions 7.7% (109.3 σ)
- rbmap_10 task-clock 29.9% (31.2 σ)
- rbmap_10 wall-clock 29.9% (31.1 σ)
- rbmap_fbip instructions 3.4% (1806.2 σ)
- rbmap_library instructions 3.1% (468.4 σ)
- rbmap_library task-clock 19.8% (48.0 σ)
- rbmap_library wall-clock 19.7% (53.5 σ)
+ reduceMatch maxrss -12.8% (-26.1 σ)
- simp_arith1 branches 3.8% (224.0 σ)
- simp_arith1 instructions 4.3% (379.0 σ)
+ simp_arith1 maxrss -2.3% (-156.0 σ)
+ stdlib maxrss -1.3% (-20.9 σ)
- stdlib task-clock 2.5% (96.5 σ)
+ tests/bench/ interpreted maxrss -2.3% (-157.9 σ)
+ unionfind maxrss -14.5% (-83.4 σ)
+ workspaceSymbols maxrss -2.8% (-37.4 σ)
+ workspaceSymbols task-clock -5.0% (-23.8 σ)
+ workspaceSymbols wall-clock -5.0% (-23.5 σ) |
Hmm, that did help with rbmap_1 instructions, but otherwise a bit mixed? I updated the base branch as well to get in new benchmarks but I can rerun the previous mimalloc version on it if a direct comparison would be helpful! |
…3.1.4 Daan (53): bump version to 1.9.4 for further development update readme fix release build warning (unused mi_stat_total_print) fix dynamic override test on non-windows platforms fix invalid pointer detection in release mode (issue #1051 and #1053) fix dynamic override test on non-windows platforms make dynamic override test verbose nicer cmake logic for windows override test add comment fix alpine compilation with prctl.h (issue #1059) typo fix prctl.h includes for alpine linux/musl (hopefully fixes #1065, #1066, #1067) validate pointer before assertion in mi_free_size (issue #754) fix assertion in mi_free_size (issue #754) fix cast on msvc fix mi_ctz/clz/bsf/bsr to avoid msvc compilation bug (issue #1071, pr #1081) revert use of selectany for msvc (issue #1078) add fixed TLS slot test to pipeline on Windows fix debug assertion for windows TLS make windows fixed TLS opt-in fix syntax error fix stats for pages and page_bins merge 3301ba0 to fix page_bin and pages statistics ensure page stats are done on the subproc stats in case the tld is NULL merge 3b567b1 to track pages statistics correctly fix merge conflict fix output for api test fix api test for 32-bits fix format specifier for numa nodes fix format specifier in stat output fix missing csize assignment in _mi_os_free_ex check all os_commit calls and return NULL on failure fix base address if commit fails on aligned overallocation fix missing csize assignment in _mi_os_free_ex check all os_commit calls and return NULL on failure fix base address if commit fails on aligned overallocation fix windows 32-bit compilation fix merge error check return value of mi_arena_commit, potential fix for #1098 use main instead of master in readme add initial support for _mi_prim_reuse and MADV_FREE_REUSABLE on macOS (issue #1097) add MI_UNUSED for unix _mi_prim_reuse fall back to MADV_DONTNEED if MADV_FREE_REUSABLE fails on macOS; disable use of MADV_FREE_REUSE on a reset (issue #1097) add _mi_os_zalloc fix assertion in mi_os_ensure_zero fix assertion improve handling of commit failure in arena allocation make macOS interposes compile for older macOS versions (by @Noxybot, PR #1028) fix link error without static library build (by @fd00, PR #1082) fix build for TSAN tests fix armv7 detection update vcpkg config to 1.9.4 update readme for upcoming release Daan Leijen (3): change default page_reclaim_max and change reclamation test to potentially address perf regresion in mimalloc v3 with respect to v2 -- see also leanprover/lean4#7786 add mi_process_done to the api do not automatically call mi_process_done if mi_option_destroy_on_exit > 1 Eduard Voronkin (1): fix recursion in TLS init on Android daanx (41): nicer show arenas fix build error on msvc/clang-cl in C mode (issue #1060) refactor numa_node_count add atomic_cas_ptr_strong_acq_rel fix signed compare warning use C++ compilation with clang-cl (as well as msvc) on Windows fix typo allow size==0 for mi_prim_free (issue #1041) extend override test on windows Use second user TLS slot to avoid using reserved fields in the TEB (issue #1078) fix use of mi_bsr add unix large page size constant and adjust aligment to the large page size for large allocations improve TLS access on Windows with msvc (by Frank Richter, issue #1078) add more decl_hidden specifiers on extern variables to improve access on arm64 improve TLS access on Windows with msvc (by Frank Richter, issue #1078) add more decl_hidden specifiers on extern variables to improve access on arm64 use TlsAlloc with a dynamic offset for MI_WIN_USE_FIXED_TLS by default (issue #1078) fix MI_WIN_USE_FIXED_TLS conditions fix TLS initialization for MI_WIN_USE_FIXED_TLS with redirection destroy the page map as well if MIMALLOC_DESTROY_ON_EXIT is set; see issue #1041 add page_cross_thread_max_reclaim option potential fix for low address allocations, issue #1087 add alpine x86 docker file fix include of prctl.h on alpine linux x86 fix format specifier (for alpine linux x86, issue #1086) fix guarded sample rate of 1 (issue #1085) add guarded TLS test for Windows fixed TLS refine test for allowing page reclaim on free to limit reclaim if the originating heap is part of a threadpool use proper atomics for the page_map fix valid pointer check for low addresses (#issue 1087) always use non-flat map in secure mode and validate pointers passed to free don't reset stats at process start so page_map commit is included get Windows version dynamically add _mi_stats_init to initialize the process start time fix OS allocation size tracking in the memid fix page map resolving for low addresses (issue #1087) mark assert_fail as cold and noreturn; move assert to internal.h (see issue #1091, and python/cpython#134586) define mi_decl_align separate from mi_decl_cache_align ensure use of subproc_main during unsafe destroy to avoid derefercing the heap use subproc_main for final statistics output to avoid dereferencing the heap enable building for xbox, based on pr #1084 by @maxbachmann
Hi @Kha -- can you run the benchmarks again but with tag |
!bench |
Here are the benchmark results for commit 8d9b69c. Benchmark Metric Change
=========================================================================
+ Init.Data.BitVec.Lemmas maxrss -17.9% (-43.5 σ)
+ Init.Data.BitVec.Lemmas re-elab maxrss -6.9% (-32.2 σ)
+ Init.Data.List.Sublist async maxrss -23.4% (-50.2 σ)
+ Init.Data.List.Sublist re-elab -j4 maxrss -29.0% (-121.5 σ)
+ Init.Prelude async maxrss -18.3% (-42.4 σ)
+ Std.Data.DHashMap.Internal.RawLemmas maxrss -8.9% (-27.6 σ)
+ Std.Data.Internal.List.Associative maxrss -3.5% (-22.8 σ)
- big_do branches 1.1% (131.4 σ)
- big_do instructions 1.4% (187.3 σ)
+ big_omega.lean maxrss -2.0% (-54.1 σ)
+ big_omega.lean MT maxrss -2.4% (-27.0 σ)
+ binarytrees instructions -1.3% (-1769.7 σ)
+ binarytrees maxrss -38.8% (-228.0 σ)
- bv_decide_inequality.lean maxrss 2.0% (30.0 σ)
- bv_decide_large_aig.lean branch-misses 62.5% (156.1 σ)
- bv_decide_large_aig.lean branches 3.4% (111.5 σ)
- bv_decide_large_aig.lean instructions 3.7% (157.1 σ)
- bv_decide_mul maxrss 1.0% (77.8 σ)
- channel.lean bounded0_spsc 4.7% (26.6 σ)
+ channel.lean boundedn_spsc -11.0% (-24.6 σ)
+ channel.lean maxrss -24.3% (-34.6 σ)
- channel.lean unbounded_seq 9.8% (23.0 σ)
- const_fold instructions 8.4% (64.6 σ)
- deriv instructions 10.4% (185.3 σ)
- identifier auto-completion branches 3.2% (23.9 σ)
- identifier auto-completion instructions 3.8% (24.4 σ)
+ identifier auto-completion maxrss -4.2% (-56.9 σ)
- identifier auto-completion task-clock 9.0% (28.0 σ)
- identifier auto-completion wall-clock 12.5% (26.7 σ)
- ilean roundtrip branches 2.4% (99.7 σ)
- ilean roundtrip instructions 3.1% (105.2 σ)
- ilean roundtrip parse 15.7% (34.3 σ)
- ilean roundtrip task-clock 18.3% (25.8 σ)
- ilean roundtrip wall-clock 18.3% (25.9 σ)
- import Lean branches 5.8% (376.3 σ)
- import Lean instructions 6.7% (1090.2 σ)
+ import Lean maxrss -1.4% (-53.2 σ)
- lake build clean instructions 5.5% (471.3 σ)
+ lake build clean maxrss -16.5% (-171.3 σ)
+ lake build no-op maxrss -14.3% (-48.8 σ)
- lake config elab instructions 4.1% (114.2 σ)
+ lake config elab maxrss -2.6% (-64.5 σ)
- lake config import instructions 7.3% (166.4 σ)
- lake config tree instructions 6.9% (1506.2 σ)
+ lake config tree maxrss -1.2% (-20.2 σ)
- lake env instructions 7.4% (150.7 σ)
+ lake env maxrss -1.2% (-39.5 σ)
- lake startup instructions 5.8% (136.4 σ)
- language server startup branches 4.6% (79.5 σ)
- language server startup instructions 5.4% (85.1 σ)
+ language server startup maxrss -3.8% (-36.2 σ)
- omega_stress.lean async branch-misses 7.4% (135.8 σ)
+ omega_stress.lean async maxrss -9.5% (-50.0 σ)
- rbmap instructions 2.7% (183.8 σ)
- rbmap_1 instructions 20.4% (216.0 σ)
- rbmap_1 task-clock 41.5% (20.3 σ)
- rbmap_1 wall-clock 41.5% (20.3 σ)
- rbmap_10 instructions 7.7% (198.6 σ)
- rbmap_fbip instructions 3.5% (331.1 σ)
- rbmap_fbip task-clock 18.4% (34.2 σ)
- rbmap_fbip wall-clock 18.4% (32.3 σ)
- rbmap_library instructions 3.1% (605.6 σ)
+ reduceMatch maxrss -13.1% (-105.2 σ)
+ riscv-ast.lean maxrss -3.3% (-22.8 σ)
- simp_arith1 branches 3.8% (91.9 σ)
- simp_arith1 instructions 4.3% (119.7 σ)
+ simp_arith1 maxrss -2.3% (-39.1 σ)
+ stdlib maxrss -1.3% (-33.0 σ)
- stdlib wall-clock 2.1% (25.7 σ)
+ tests/bench/ interpreted maxrss -2.4% (-161.5 σ)
- tests/bench/ interpreted task-clock 3.0% (24.8 σ)
+ unionfind maxrss -14.4% (-143.3 σ)
+ workspaceSymbols maxrss -2.8% (-52.9 σ) |
For testing, not intended to be merged before a stable release of mimalloc 3