-
Notifications
You must be signed in to change notification settings - Fork 103
fix(mm): reintroduce explicit virtual to physical address translation for device memory #1815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark Results
This comment was automatically generated by github-action-benchmark.
Misc
Benchmark | Current: 1f191b0 | Previous: b5b5c19 | Performance Ratio |
---|---|---|---|
micro_benchmarks Build Time | 74.81 s |
74.72 s |
1.00 |
micro_benchmarks File Size | 0.97 MB |
0.97 MB |
1.00 |
Scheduling time - 1 thread | 66.33 ticks (±2.44 ticks) |
66.83 ticks (±3.10 ticks) |
0.99 |
Scheduling time - 2 threads | 34.48 ticks (±1.50 ticks) |
35.64 ticks (±3.40 ticks) |
0.97 |
Micro - Time for syscall (getpid) | 16.14 ticks (±1.18 ticks) |
15.71 ticks (±1.36 ticks) |
1.03 |
Memcpy speed - (built_in) block size 4096 | 74027.79 MByte/s (±51036.26 MByte/s) |
73338.41 MByte/s (±50542.56 MByte/s) |
1.01 |
Memcpy speed - (built_in) block size 1048576 | 41178.80 MByte/s (±28611.10 MByte/s) |
41237.34 MByte/s (±28631.07 MByte/s) |
1.00 |
Memcpy speed - (built_in) block size 16777216 | 25742.90 MByte/s (±20751.74 MByte/s) |
26159.73 MByte/s (±21200.42 MByte/s) |
0.98 |
Memset speed - (built_in) block size 4096 | 74199.08 MByte/s (±51151.48 MByte/s) |
73371.89 MByte/s (±50563.99 MByte/s) |
1.01 |
Memset speed - (built_in) block size 1048576 | 41460.06 MByte/s (±28810.35 MByte/s) |
41509.80 MByte/s (±28815.09 MByte/s) |
1.00 |
Memset speed - (built_in) block size 16777216 | 26386.40 MByte/s (±21135.66 MByte/s) |
26803.32 MByte/s (±21575.74 MByte/s) |
0.98 |
Memcpy speed - (rust) block size 4096 | 62001.44 MByte/s (±43372.14 MByte/s) |
66444.83 MByte/s (±46279.47 MByte/s) |
0.93 |
Memcpy speed - (rust) block size 1048576 | 40883.33 MByte/s (±28399.62 MByte/s) |
41301.42 MByte/s (±28653.77 MByte/s) |
0.99 |
Memcpy speed - (rust) block size 16777216 | 25689.12 MByte/s (±20708.36 MByte/s) |
26198.19 MByte/s (±21210.46 MByte/s) |
0.98 |
Memset speed - (rust) block size 4096 | 62710.83 MByte/s (±43829.52 MByte/s) |
66823.77 MByte/s (±46541.34 MByte/s) |
0.94 |
Memset speed - (rust) block size 1048576 | 41121.21 MByte/s (±28560.50 MByte/s) |
41550.11 MByte/s (±28821.87 MByte/s) |
0.99 |
Memset speed - (rust) block size 16777216 | 26318.36 MByte/s (±21079.12 MByte/s) |
26858.59 MByte/s (±21598.51 MByte/s) |
0.98 |
alloc_benchmarks Build Time | 74.58 s |
72.59 s |
1.03 |
alloc_benchmarks File Size | 0.92 MB |
0.92 MB |
1.00 |
Allocations - Allocation success | 100.00 % |
100.00 % |
1 |
Allocations - Deallocation success | 70.03 % (±0.26 %) |
70.01 % (±0.26 %) |
1.00 |
Allocations - Pre-fail Allocations | 100.00 % |
100.00 % |
1 |
Allocations - Average Allocation time | 11030.05 Ticks (±188.57 Ticks) |
11052.09 Ticks (±196.08 Ticks) |
1.00 |
Allocations - Average Allocation time (no fail) | 11030.05 Ticks (±188.57 Ticks) |
11052.09 Ticks (±196.08 Ticks) |
1.00 |
Allocations - Average Deallocation time | 818.63 Ticks (±15.74 Ticks) |
833.82 Ticks (±17.80 Ticks) |
0.98 |
mutex_benchmark Build Time | 73.99 s |
73.82 s |
1.00 |
mutex_benchmark File Size | 0.97 MB |
0.97 MB |
1.00 |
Mutex Stress Test Average Time per Iteration - 1 Threads | 14.06 ns (±0.54 ns) |
14.18 ns (±0.65 ns) |
0.99 |
Mutex Stress Test Average Time per Iteration - 2 Threads | 16.56 ns (±1.60 ns) |
16.92 ns (±1.06 ns) |
0.98 |
Misc
Benchmark | Current: 1f191b0 | Previous: 885734c | Performance Ratio |
---|---|---|---|
micro_benchmarks Build Time | 75.17 s |
92.18 s |
0.82 |
micro_benchmarks File Size | 0.97 MB |
0.97 MB |
1.00 |
Scheduling time - 1 thread | 67.77 ticks (±2.86 ticks) |
67.38 ticks (±3.72 ticks) |
1.01 |
Scheduling time - 2 threads | 36.34 ticks (±2.05 ticks) |
34.93 ticks (±1.69 ticks) |
1.04 |
Micro - Time for syscall (getpid) | 16.09 ticks (±1.56 ticks) |
15.86 ticks (±1.09 ticks) |
1.01 |
Memcpy speed - (built_in) block size 4096 | 73392.17 MByte/s (±50748.98 MByte/s) |
72968.84 MByte/s (±50564.82 MByte/s) |
1.01 |
Memcpy speed - (built_in) block size 1048576 | 40972.89 MByte/s (±28537.54 MByte/s) |
41555.93 MByte/s (±28865.47 MByte/s) |
0.99 |
Memcpy speed - (built_in) block size 16777216 | 26388.97 MByte/s (±21907.17 MByte/s) |
26104.96 MByte/s (±21970.37 MByte/s) |
1.01 |
Memset speed - (built_in) block size 4096 | 73439.26 MByte/s (±50782.33 MByte/s) |
72990.31 MByte/s (±50579.80 MByte/s) |
1.01 |
Memset speed - (built_in) block size 1048576 | 41213.39 MByte/s (±28699.74 MByte/s) |
41811.90 MByte/s (±29042.09 MByte/s) |
0.99 |
Memset speed - (built_in) block size 16777216 | 27055.54 MByte/s (±22283.63 MByte/s) |
26890.08 MByte/s (±22397.84 MByte/s) |
1.01 |
Memcpy speed - (rust) block size 4096 | 61712.80 MByte/s (±42993.06 MByte/s) |
64378.81 MByte/s (±44953.84 MByte/s) |
0.96 |
Memcpy speed - (rust) block size 1048576 | 41212.17 MByte/s (±28649.36 MByte/s) |
41277.27 MByte/s (±28662.33 MByte/s) |
1.00 |
Memcpy speed - (rust) block size 16777216 | 25937.96 MByte/s (±21466.33 MByte/s) |
26660.93 MByte/s (±22238.57 MByte/s) |
0.97 |
Memset speed - (rust) block size 4096 | 61924.57 MByte/s (±43141.43 MByte/s) |
64487.90 MByte/s (±45017.04 MByte/s) |
0.96 |
Memset speed - (rust) block size 1048576 | 41472.32 MByte/s (±28826.55 MByte/s) |
41532.86 MByte/s (±28836.12 MByte/s) |
1.00 |
Memset speed - (rust) block size 16777216 | 26582.99 MByte/s (±21833.10 MByte/s) |
27388.51 MByte/s (±22664.91 MByte/s) |
0.97 |
alloc_benchmarks Build Time | 74.80 s |
88.27 s |
0.85 |
alloc_benchmarks File Size | 0.92 MB |
0.92 MB |
1.00 |
Allocations - Allocation success | 100.00 % |
100.00 % |
1 |
Allocations - Deallocation success | 69.99 % (±0.30 %) |
69.97 % (±0.35 %) |
1.00 |
Allocations - Pre-fail Allocations | 100.00 % |
100.00 % |
1 |
Allocations - Average Allocation time | 14591.74 Ticks (±292.45 Ticks) |
13449.07 Ticks (±258.41 Ticks) |
1.08 |
Allocations - Average Allocation time (no fail) | 14591.74 Ticks (±292.45 Ticks) |
13449.07 Ticks (±258.41 Ticks) |
1.08 |
Allocations - Average Deallocation time | 1119.40 Ticks (±252.37 Ticks) |
852.92 Ticks (±69.83 Ticks) |
1.31 |
mutex_benchmark Build Time | 75.34 s |
91.20 s |
0.83 |
mutex_benchmark File Size | 0.97 MB |
0.97 MB |
1.00 |
Mutex Stress Test Average Time per Iteration - 1 Threads | 14.08 ns (±1.07 ns) |
14.26 ns (±0.48 ns) |
0.99 |
Mutex Stress Test Average Time per Iteration - 2 Threads | 24.14 ns (±15.01 ns) |
20.98 ns (±14.98 ns) |
< 8000 code class="notranslate">1.15 |
General
Benchmark | Current: 1f191b0 | Previous: 885734c | Performance Ratio |
---|---|---|---|
startup_benchmark Build Time | 72.41 s |
69.30 s |
1.04 |
startup_benchmark File Size | 0.85 MB |
0.86 MB |
1.00 |
Startup Time - 1 core | 0.99 s (±0.06 s) |
0.93 s (±0.04 s) |
1.07 |
Startup Time - 2 cores | 0.99 s (±0.03 s) |
0.93 s (±0.04 s) |
1.06 |
Startup Time - 4 cores | 0.99 s (±0.03 s) |
0.93 s (±0.04 s) |
1.06 |
multithreaded_benchmark Build Time | 75.05 s |
68.22 s |
1.10 |
multithreaded_benchmark File Size | 0.96 MB |
0.96 MB |
1.00 |
Multithreaded Pi Efficiency - 2 Threads | 89.71 % (±9.87 %) |
86.87 % (±8.71 %) |
1.03 |
Multithreaded Pi Efficiency - 4 Threads | 61.14 % (±6.88 %) |
61.17 % (±6.32 %) |
1.00 |
Multithreaded Pi Efficiency - 8 Threads | 43.44 % (±2.88 %) |
41.25 % (±5.92 %) |
1.05 |
c88bd46
to
6bcb80e
Compare
zyuiop
reviewed
Jul 7, 2025
jounathaen
reviewed
Jul 7, 2025
1bdcaf7
to
6fd2fe2
Compare
24d2cf1
to
48186db
Compare
adc369a
to
707d6f5
Compare
jounathaen
reviewed
Jul 8, 2025
jounathaen
reviewed
Jul 8, 2025
jounathaen
reviewed
Jul 8, 2025
jounathaen
approved these changes
Jul 8, 2025
stlankes
approved these changes
Jul 8, 2025
1f191b0
to
6bd1362
Compare
sarahspberrypi
approved these changes
Jul 9, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before #1609, #1669, and #1670, we were mapping frames and flushing on each device memory allocation, which was expensive.
Those PRs changed the initial memory mappings by ensuring an identity mapping of all physical memory.
That allowed for (basically) no-op device memory allocation while reducing TLB pressure.
Before #1712, we were walking the page table on every virtual to physical address translation for device communication, which was not costly, but still slower than necessary.
This PR reintroduces explicit virtual to physical address translation when handling device memory, but does so while avoiding the performance pitfalls of the past.
We now have the option to map the complete physical memory a second time at an offset.
This is currently done on
cfg!(careful)
to ensure all devices use the device allocator not only for memory management but also for address translation.Eventually, this allows us to mark one of the mappings as private and the other as public, making flexible device communication as cheap as possible.