-
Notifications
You must be signed in to change notification settings - Fork 24.1k
MPS device can train or evaluate models producing unacceptable output due to "fast math" optimization #84936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
1e-7 is expected due to slightly different fp precision |
I thought they were both fp32? |
To better illustrate the problem that motivated me to post this issue, I can present some output from CURL. This first image is from the CPU: This is from the mps device: The difference in output comes down to a cascade of tiny differences in fp values that avalanche into large differences. I think these calculations need to be exactly the same on the cpu and mps device to ensure correctness. Otherwise, I don't see how the mps device can be relied on. |
Is a PR to flip these to NO the correct next step here? pytorch/aten/src/ATen/mps/MPSDevice.mm Line 36 in ce7177f
pytorch/aten/src/ATen/mps/MPSDevice.mm Line 36 in ce7177f
|
I've looked into this as well, and I don't think this will make a difference. In fact, IIRC I recompiled PyTorch with these set to As far as I can tell, PyTorch's MPS device is backed by Apple's MPSGraph framework. What we need is an ability to "recompile" or select a version of MPSGraph that has fast math disabled, assuming my hypothesis that it's enabled for those shaders is correct. Unfortunately, like many of Apple's frameworks, MPSGraph is closed source. Otherwise, I'd be digging in there already. I know there are some Apple engineers that work on this codebase. I believe they will be able to diagnose and address this issue. Hmmm... this makes me wonder if there's a way to inspect whether a shader has been compiled with fast math enabled without having access to its source code... hmmm... |
The attached file is an Xcode project which demonstrates that the value of For any Apple engineers watching this issue, my feedback report number is FB11657311. |
I need to refine my assessment of the problem. The problem is not with IEEE 754 conformance. Rather, I'd suggest the problem is that the implementation of certain transcendental/elementary functions with Metal's "fast math" library produce different results for the same inputs compared to the standard C |
@kulinseth This issue was reported for macOS 13. Now that macOS Sonoma (14) beta has been released, can you comment on whether this issue has been (or will be) addressed in that version? I'm unlikely to install the Sonoma beta myself for the time being, but am hopeful we will have a solution in this release. |
@mallman , we have fixed the "fast-math" issue even on MacOS Ventura. Can you please try 13.4 Ventura and see if the issue still persists? |
I tried the python script in this issue's description, and I tried the Xcode project I attached (MPSImprecision.zip). Both still show the same difference in calculation between cpu and mps. I'm using Venture 13.4, Xcode 14.3.1 and Pytorch 2.0.0. So, yes, the issue still persists. Are you seeing different behavior on your end? FWIW, my feedback report (FB11657311) has been marked as "Potential fix identified - For a future OS update". That hasn't changed in a long time (not sure when). |
@kulinseth As I indicated, this issue still exists in macOS 13.4. Can you validate that this issue is fixed (or will be fixed) in Sonoma? It is a significant impediment to using the |
I encountered the same issue, where there was a significant discrepancy between the final results when running my network on MPS compared to the nvGPU and CPU versions. Upon debugging, I found that there were errors in every step of the inference in the transformer layer, which resulted in very poor final results. This issue is very severe and directly affects the viability of using MPS. |
I agree. There are two things I'm aware of which are holding back MPS from perfect (or close enough) training/prediction fidelity with CPU in Pytorch.
I would say point 1 is more important, and easier to fix. Point 2 is relevant in some but not all models. In some cases, lower precision is perfectly acceptable and is desirable for savings in execution time and storage. It would be nice to see an implementation of bfloat16, as well. (Actually it looks like bloat16 is coming to Sonoma, at least on supported hardware.) I'm very pessimistic about Apple supporting a 64-bit fp type in Metal anytime soon. I believe their current GPU hardware is limited to 32-bit fp precision, and the neural processor is 16-bit. Incidentally, @KouseiHongqing, what floating point precision are you using to run/train your network on cpu and nvgpu? (I have very little experience with nVidia accelerators, but I believe they support fp64.) It would be interesting to see if you get acceptable results using 32-bit floating point on cpu and nvidia. At least then you can assume you won't need metal 64-bit floating point support. If you get inadequate results with fp32, then addressing the issue with MPS "fast math" likely won't help your use case. As you can see from earlier in this thread, an Apple engineer claimed this issue had been fixed in Ventura. Even though my testing showed it hadn't been fixed, I hope this means they are at least taking this issue seriously. Cheers. |
@mallman I am using PyTorch's mixed-precision computing, which includes FP16 and FP32. However, they both have varying degrees of errors, and I don't think it's due to fast math. As an alternative, I am running inference using the CPU on my Mac Studio, which is three times slower than running it with MPS. |
In the latest version of PyTorch (nightly0918), my precision issue has been resolved. Many thanks to the support. pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu |
I am following this tutorial on diffusion models and even just adding noise to the images, the results are completely different in
|
@atabakd , can you please file a separate issue? |
I'm genuinely mystified why this issue hasn't been resolved. Is the mps codebase plagued by such numerous or complex bugs that this one can't be addressed in a year's time? Is this not so easily fixed? I'm not suggesting someone else fix a problem I could. I've actually put in my due diligence before even filing this issue, looking for a fix in Pytorch. And I've always been happy to contribute fixes to open source software, but this mps integration is with closed source software from an opaque partner organization. As far as anyone outside Apple knows, we are as close or as far to a resolution to this bug as we were a year ago. Am I being unfair in expressing my frustration? |
I think I have a related and very simple example of how MPS can get completely different results from CPU. Hopefully the simplicity of this issues will be clear and helpful.
Notice how the first two np.where and torch.where examples give them same results, but when using the tensor converted to MPS we get different results? If I've not made an obvious mistake, this is a clear example of how MPS completely ruins calculations, because in this case, the indexes change, and all downstream calculations become meaningless. |
I had issue not with numerical stability (as I initially thought) but with MPS memory. When didn't immediately convert data to CPU and instead collected it into list it tend to zero MPS memory. So if you work with MPS device you should do conversion asap. Just do your calculations (in which MPS is really great and saves ton of time) and just go back to CPU mode. |
This one was fixed a while back, trying to find a PR number, but it had something to do with signed vs unsigned dtypes |
I think there are quite a different issues being discussed here. To help with tracking, I think that:
So I think we can close this issue? |
This issue isn't related to the document you reference. This is an issue about MPS using imprecise implementations of transcendental functions (so-called "fast math" from the Metal standard library). The title of this issue is misleading. I will fix that.
IMO, we can close this issue for two reasons:
Of course, I have a preference for (1), but the implementation is opaque (closed source), so there's really nothing anyone can do about this unless they have access to the source code at Apple. Alternatively, we can decide/discover a different underlying reason for this problem. That would require someone with access to the source code for MPS, i.e. an Apple employee. |
I think another possible source of confusion around this specific Pytorch issue is that many other Pytorch users added cases of bugs they encountered, some of which were apparently not caused by or related to the "fast math" setting for MPS. For example, this issue is not about "torch.where is incorrect on MPS when there are 2**24 elements in tensor". |
Alternative approach would be to provide an explicit Metal implementations for the operators, which must be compiled without a fast math. But I'm curious to see a bit more justification and tradeoffs for doing so. Or, it could be an opt-in for an MPS backend (i.e. torch.backends.mps.use_fast_math = False will dispatch to that kernel, this way we don't need to justify a perf drop) |
Justification: The task of this algorithm is image enhancement. The second picture clearly has a severe purplish cast that makes for a qualitatively inferior, if not bizarre, result. What good is "fast" if it "quickly" gets you a bad result? When I use a program for scientific computing (or any computation, really), I want to have confidence that I will get good results. Really, I think the question should be: what is the justification for "fast" but "bad" results? |
Within PyTorch, we don't have any justification for it and we always want all operations to be as precise as specified by the format standard. We do disable by default any low precision compute even at the cost of performance (see tf32 discussion for the latest major event there). The problem here as I understand is that the MPS library that is not under our control and has a very long release cycle is not providing us with the right tools. So our short term options are to rewrite a new kernel from scratch, have that function error out or have an imprecise kernel. @kulinseth can you confirm what is the long term plan here and if it is indeed to ensure that all kernels in MPS will be compiled without fast-math (or at least will have a version without fast math that can be used here)? |
Hi @mallman, MPS is a framework in OS which has been updated to use
For perf reasons these were kept in Fast precise mode
As @malfet said, if there are use-cases we can add Metal kernels in precise mode. The image is a good example but we would need a use-case which shows Training convergence error or a problem with inference which is caused by numerical issue. I am happy to pre-emptively move the exp and tanh to Precise mode. Let me propose a PR for it. |
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Fix bug in non-contiguous tensors handling Fixes #84936 [ghstack-poisoned]
@kulinseth Thank you for addressing this issue. I will let report back if the underlying problem is not fixed. Cheers. |
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes pytorch#84936 Pull Request resolved: pytorch#128421 Approved by: https://github.com/kulinseth ghstack dependencies: pytorch#128373, pytorch#128375
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes pytorch#84936 Pull Request resolved: pytorch#128421 Approved by: https://github.com/kulinseth ghstack dependencies: pytorch#128373, pytorch#128375
Was this issue solved? I'm running same pytorch model from CUDA device and MPS device. The results seem different and I'm thinking this issue might be the underlying cause. |
🐛 Describe the bug
I am getting radically different results running the CURL model on the cpu versus the mps device (on pytorch 1.12.1 and nightly). I stepped through the debugger to find the underlying difference in calculation, In short, it appears that mps is using the "fast math" version of the standard library, leading to unacceptable results for some models. This is not a bug with CURL.
Here's a minimal example that doesn't work the way I expect:
Versions
Here's my pytorch 1.12.1 environment info:
And here's my pytorch nightly env:
The text was updated successfully, but these errors were encountered: