-
Notifications
You must be signed in t 8000 o change notification settings - Fork 24.1k
aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Assigning to myself to get a quick repro... |
Hi-pri as in 2.3.0, it just crashes MPS runtime:
Now I wonder whether this is a regression or not... |
@malfet If it helps, I installed a new env on my machine and tested against 2.1, and got the same results.
|
|
There is a change for |
Fixes #124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: #125355 Approved by: https://github.com/kulinseth (cherry picked from commit a40d6df)
[MPS] Native nonzero implementation (#125355) Fixes #124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: #125355 Approved by: https://github.com/kulinseth (cherry picked from commit a40d6df) Co-authored-by: Denis Vieriu <dvieriu@apple.com>
From what I see, the crashing issue is still there in the upcoming 2.3.1 release on my M1 laptop. Reopening this issue for now and I'll double check with @malfet
|
removing from milestone since still not fixed as of 7/1/20204 |
Pulled from CPU
MPS
Env
|
I believe it is. I might have a fix - first I just need to verify correctness. At least the runtime is down to something more reasonable. MPS
|
🐛 Describe the bug
I found that running a torchvision model under MPS backend was extremely slow compared to cpu.
I ran the profiler and found that the vast majority of that time was coming from a small number of calls to aten::nonzero.
Using the repro below with
cpu
device takes ~1s to run, but switching tomps
increases this to ~75s, most of which is spent in aten::nonzero.I wonder if this might be related to #122916.
repro
CPU profile results
MPS profile results
P.S. I think running the
mps
repro above might have hard crashed my laptop (it happened whilst i was writing this issue for the first time), but I don't have access to another machine to test that this isn't an issue with my macine.Versions
Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 12:57:28) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M3 Pro
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[conda] numpy 1.26.4 py312h7f4fdc5_0
[conda] numpy-base 1.26.4 py312he047099_0
[conda] pytorch 2.2.2 py3.12_0 pytorch
[conda] torchvision 0.17.2 py312_cpu pytorch
cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen
The text was updated successfully, but these errors were encountered: