-
Notifications
You must be signed in to change notification settings - Fork 163
Fix negative duration panic on macos #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/with_winit/src/stats.rs
Outdated
@@ -386,7 +386,7 @@ pub fn draw_gpu_profiling( | |||
cur_text_y = text_y + text_height; | |||
let label = format!( | |||
"{:.2?} - {:.30}", | |||
Duration::from_secs_f64(this_time), | |||
Duration::from_secs_f64(this_time.max(0.0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is a bug in wgpu
or something else where a monotonic timer isn't being used ...
Thanks for the PR - I just have one question, which is have you validated how negative the number is? I ask because if it's very small ( I believe @DasLixou was seeing something similar on the Zulip, where the timings from the profiler didn't seem right. Although in that case, it was AMD DirectX12 rather than macOS |
Yeah, that suggests there's some definite weirdness going on. The number always being the same is also extremely suspect - that suggests there's something more specific going on. Can you print the byte value of that number (i.e. |
Details
|
I think this PR should show the negative value in the visible output (instead of zero). That is, by producing a Duration of the absolute value. It's not clear to me why this is happening - @Wumpf have you got any ideas? Edit: To be clear - I'll happily merge this if it doesn't clamp but shows the negative value, perhaps with an |
No I did some more digging with the profiler, my other graphics card works fine, but the RX 6600 (now with latest drivers) somehow misplaces the timing for the scopes (parent scope sometimes begin 40ms after all the other stuff has already run. This happens for dx12 and Vulkan so I think it's a bug in the driver giving back the time, but I'm not sure and don't know any further now |
OK just read the whole conversation, if this is the same problem as I have, then i think its a bug between wgpu_profiler and the graphics card. Maybe we could do a crappy wgpu render example with wgpu_profiler and see if there's also the problem |
I believe this is somewhere between known & hard to avoid issue and problematic barrier placement in wgpu 🤔 |
that said this all depends ofc where much on the use of timer scopes. I'll see if I can find a bit of time to poke around here to learn more - any particular places in the code that I should look at? |
Thanks - at least it's not too unexpected. Is there a wgpu issue we can track? The main use of the scopes is in |
These two are fitting best (given that it's really the issue I'm suspecting): |
I don't think those explain the really odd behaviour we're seeing. Like, I don't think it's reasonable for a dispatch to finish 20ms before it started. Those issues seem to be discussing how two timestamps can happen arbitrarily close to one another, even if the passes themselves are slow. But I don't think that should mean one timestamp happens before the other? |
Since gpu processing of things that aren't protected by barriers can be reordered in execution arbitrarily it is still very relevant. Right now it can easily happen that one query is part of a block of memory operations and another is part of a dispatch and then the driver/gpu decides to reorder one after the other (that sort of thing is what happens on wgpu's Metal implementation a lot because of the way it creates MTLCommandEncoder on the fly). But there could be more to it. Probably also best to check with RenderDoc/XCode-frame-profiler to see where which barrier and which query ends up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've also added a comment which explains this
The Vulkan spec only provides writeTimestamp(A)
do_some_gpu_work(...)
writeTimestamp(B)
// Running via Vulkan, gets reordered on the GPU as:
do_some_gpu_work(...)
writeTimestamp(A) // Totally compliant with Vulkan spec
writeTimestamp(B) So the quality of measurements will depend on how much the GPU manufacturer went beyond the spec in retaining execution order. Very unfortunate, especially as the reordering can be dynamic and depend on the measured instructions. So measuring a different execution time may just mean that the GPU optimizer resulted in a different execution order. The issue on macOS is slightly different, because Metal doesn't provide even the already weak options that Vulkan has. Or at least the implementations haven't found those options yet. I didn't find a Metal specific writeTimestamp(A)
do_some_gpu_work(...)
writeTimestamp(B)
// Running via Metal, gets even worse:
do_some_gpu_work(...)
writeTimestamp(B)
writeTimestamp(A) // The joys of the current Dawn implementation Even though this is a Dawn issue and I didn't find the matching
In summary, here's where we stand right now:
|
Even taking that into account, which I agree definitely permits some negative durations, I still think there's some strangeness. I'm less sure that permits this negative duration. I'm assuming two fairly weak invariants in making that assertion:
The 20ms gap is long enough for a complete cycle to have been done, i.e. if everything is working ccorrectly, we're seeing one of those invariants be broken (also because we're blocking on the timestamp data being returned, because of using
|
20ms does seem awfully long for M2 Pro, so indeed there's probably an additional factor. I checked the timer Metal code in So we know for a fact that both the |
excellent writeup @xStrom ! |
The
winit
example sometimes panics when running on macos. This pr fix this by adding a.max(0.0)
to clip it to 0 before making aDuration