8000 aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU · Issue #124850 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
theo-costain-arondite opened this issue Apr 24, 2024 · 9 comments
Assignees
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@theo-costain-arondite
Copy link
theo-costain-arondite commented Apr 24, 2024

🐛 Describe the bug

I found that running a torchvision model under MPS backend was extremely slow compared to cpu.
I ran the profiler and found that the vast majority of that time was coming from a small number of calls to aten::nonzero.
Using the repro below with cpu device takes ~1s to run, but switching to mps increases this to ~75s, most of which is spent in aten::nonzero.
I wonder if this might be related to #122916.

repro

import torch
import torchvision.models as models
from tqdm import tqdm

device = "mps"

inputs = torch.randn(3, 224, 224).to(device)

transform = models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT.transforms()

dectection_model = models.detection.fasterrcnn_resnet50_fpn_v2(
    weights=models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT,
).to(device)
dectection_model.eval()

from torch.profiler import ProfilerActivity, profile, record_function

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        out = dectection_model([transform(inputs)])

print(
    prof.key_averages(group_by_input_shape=True).table(
        sort_by="cpu_time_total", row_limit=10
    )
)
prof.export_chrome_trace(f"trace_test_{device}.json")

CPU profile results

-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                      model_inference         2.25%      26.482ms       100.00%        1.176s        1.176s             1                                                                                []  
                         aten::conv2d         0.00%      14.000us        38.19%     449.105ms     112.276ms             4                         [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
                    aten::convolution         0.00%      30.000us        38.19%     449.091ms     112.273ms             4                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      33.000us        38.19%     449.061ms     112.265ms             4  [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []  
    aten::_nnpack_spatial_convolution        38.18%     448.984ms        38.19%     449.028ms     112.257ms             4                                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], []]  
                         aten::conv2d         0.00%       6.000us         9.21%     108.300ms      54.150ms             2                     [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], []]  
                    aten::convolution         0.00%      15.000us         9.21%     108.294ms      54.147ms             2             [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      13.000us         9.21%     108.279ms      54.139ms             2  [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], [], [], [], []  
                    aten::thnn_conv2d         0.00%       4.000us         9.21%     108.266ms      54.133ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
           aten::_slow_conv2d_forward         8.96%     105.342ms         9.21%     108.262ms      54.131ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.176s

MPS profile results

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.00%   -2260.000us       100.00%       74.706s       74.706s             1                                                                                []  
                     aten::where        -0.00%     -17.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      38.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                   aten::nonzero        99.24%       74.138s        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                       aten::cat         0.06%      41.956ms         0.06%      41.956ms       1.998ms            21                                                                          [[], []]  
                       aten::sub         0.05%      34.111ms         0.05%      34.111ms      34.111ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                        aten::to        -0.00%     -88.000us         0.04%      32.683ms       1.421ms            23                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.00%     676.000us         0.04%      32.681ms       2.723ms            12                                                      [[], [], [], [], [], [], []]  
                     aten::copy_         0.04%      32.618ms         0.04%      32.645ms       2.720ms            12                                                                      [[], [], []]  
                     aten::stack         0.00%      64.000us         0.03%      23.270ms       2.327ms            10                                                                          [[], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.706s

P.S. I think running the mps repro above might have hard crashed my laptop (it happened whilst i was writing this issue for the first time), but I don't have access to another machine to test that this isn't an issue with my macine.

Versions

Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 12:57:28) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[conda] numpy 1.26.4 py312h7f4fdc5_0
[conda] numpy-base 1.26.4 py312he047099_0
[conda] pytorch 2.2.2 py3.12_0 pytorch
[conda] torchvision 0.17.2 py312_cpu pytorch

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

@malfet malfet added module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mps Related to Apple Metal Performance Shaders framework labels Apr 24, 2024
@malfet malfet self-assigned this Apr 24, 2024
@malfet
Copy link
Contributor
malfet commented Apr 24, 2024

Assigning to myself to get a quick repro...

@malfet malfet added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 24, 2024
@malfet
Copy link
Contributor
malfet commented Apr 24, 2024

Hi-pri as in 2.3.0, it just crashes MPS runtime:

% python bug-124850.py
STAGE:2024-04-24 11:27:36 16333:8699272 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG14XFamilyCommandBuffer: 0x16c6a8a80>
    label = <none> 
    device = <AGXG14SDevice: 0x162850a00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x167220000>
        label = <none> 
        device = <AGXG14SDevice: 0x162850a00>
            name = Apple M2 Pro 
    retainedReferences = 1
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------
8000
---------------------------------------------  
                 model_inference         0.24%       3.440ms       100.00%        1.437s        1.437s             1                                                                                []  
                     aten::where         0.00%       4.000us        51.07%     734.106ms     183.526ms             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      32.000us        51.07%     734.102ms     183.525ms             4                                                                          [[1000]]  
                   aten::nonzero        50.60%     727.331ms        51.06%     734.028ms     183.507ms             4                                                                          [[1000]]  
                       aten::cat         3.51%      50.495ms         3.51%      50.495ms       2.405ms            21                                                                          [[], []]  
                     aten::clamp         3.50%      50.339ms         3.50%      50.339ms      50.339ms             1                                                                  [[1000], [], []]  
                       aten::sub         3.45%      49.571ms         3.45%      49.571ms      49.571ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                     aten::stack         0.00%      62.000us         1.64%      23.551ms       2.355ms            10                                                                          [[], []]  
                        aten::to        -0.01%    -148.000us         1.39%      19.996ms     952.190us            21                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.01%     179.000us         1.39%      19.988ms       1.999ms            10                                                      [[], [], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.437s

Now I wonder whether this is a regression or not...

@malfet malfet added this to the 2.3.1 milestone Apr 24, 2024
@theo-costain-arondite
Copy link
Author

@malfet If it helps, I installed a new env on my machine and tested against 2.1, and got the same results.

PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.0
[pip3] torchvision==0.16.0
[conda] numpy                     1.26.4          py311he598dae_0  
[conda] numpy-base                1.26.4          py311hfbfe69c_0  
[conda] pytorch                   2.1.0                  py3.11_0    pytorch
[conda] torchvision               0.16.0                py311_cpu    pytorch
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.02%  -12725.000us       100.00%       74.826s       74.826s             1                                                                                []  
                     aten::where         0.00%       2.925ms        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      20.000us        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                   aten::nonzero        99.06%       74.121s        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                torchvision::nms         0.12%      86.786ms         0.13%      97.168ms      97.168ms             1                                                           [[4507, 4], [4507], []]  
                       aten::cat         0.05%      39.946ms         0.05%      39.946ms       1.902ms            21                                                                          [[], []]  
                     aten::index         0.04%      31.605ms         0.05%      37.235ms      18.617ms             2                                                                 [[1, 159882], []]  
                       aten::sub         0.05%      36.937ms         0.05%      36.937ms      36.937ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                  aten::_to_copy         0.00%     627.000us         0.04%      30.244ms       1.680ms            18                                                      [[], [], [], [], [], [], []]  
                        aten::to         0.00%      93.000us         0.04%      30.234ms       1.315ms            23                                                          [[], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.826s

@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Apr 29, 2024
@kulinseth
Copy link
Collaborator

nonzero is indeed written in an inefficient way...

@kulinseth
Copy link
Collaborator

There is a change for non_zero which should help
main...skotapati:pytorch-sidk:dev/skotapati/nonzero_mps

pytorchbot pushed a commit that referenced this issue May 13, 2024
Fixes #124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: #125355
Approved by: https://github.com/kulinseth

(cherry picked from commit a40d6df)
huydhn pushed a commit that referenced this issue May 13, 2024
[MPS] Native nonzero implementation (#125355)

Fixes #124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: #125355
Approved by: https://github.com/kulinseth

(cherry picked from commit a40d6df)

Co-authored-by: Denis Vieriu <dvieriu@apple.com>
@huydhn
Copy link
Contributor
huydhn commented May 30, 2024

From what I see, the crashing issue is still there in the upcoming 2.3.1 release on my M1 laptop. Reopening this issue for now and I'll double check with @malfet

STAGE:2024-05-29 18:51:13 53295:3191016 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error:
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x36ae0d100>
    label = <none>
    device = <AGXG13XDevice: 0x1434d4a00>
        name = Apple M1 Pro
    commandQueue = <AGXG13XFamilyCommandQueue: 0x16d02d800>
        label = <none>
        device = <AGXG13XDevice: 0x1434d4a00>
            name = Apple M1 Pro
    retainedReferences = 1
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                 model_inference        -1.77%  -46315.000us       100.00%        2.612s        2.612s             1                                                                                []
                     aten::where         2.16%      56.389ms        66.03%        1.725s     431.137ms             4                                                                          [[1000]]
             aten::nonzero_numpy         0.00%      41.000us        66.03%        1.725s     431.136ms             4                                                                          [[1000]]
                   aten::nonzero        65.76%        1.718s        66.02%        1.724s     431.114ms             4                                                                          [[1000]]
                       aten::cat         2.96%      77.404ms         2.96%      77.404ms       3.686ms            21                                                                          [[], []]
                       aten::sub         2.24%      58.528ms         2.24%      58.528ms      58.528ms             1                                                    [[3, 224, 224], [3, 1, 1], []]
                     aten::stack         0.00%     103.000us         1.25%      32.740ms       3.274ms            10                                                                          [[], []]
                    aten::arange         1.00%      26.230ms         1.01%      26.254ms       2.188ms            12                                                                 [[], [], [], [0]]
                     aten::where         0.00%       6.000us         0.94%      24.461ms      12.230ms             2                                                                          [[4507]]
             aten::nonzero_numpy         0.00%      13.000us         0.94%      24.455ms      12.227ms             2                                                                          [[4507]]
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
Self CPU time total: 2.612s

@huydhn huydhn reopened this May 30, 2024
@huydhn huydhn modified the milestones: 2.3.1, 2.4.0 May 31, 2024
@atalman atalman removed this from the 2.4.0 milestone Jul 2, 2024
@atalman
Copy link
Contributor
atalman commented Jul 2, 2024

removing from milestone since still not fixed as of 7/1/20204

@hvaara
Copy link
Contributor
hvaara commented Aug 8, 2024

Pulled from main for torch and torchvision and ran the repro steps.

CPU

-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                      model_inference         1.72%      15.736ms       100.00%     914.620ms     914.620ms             1                                                                                []  
                         aten::conv2d         0.00%      19.251us        41.20%     376.793ms      94.198ms             4                         [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
                    aten::convolution         0.01%      47.873us        41.19%     376.774ms      94.194ms             4                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], []]  
                   aten::_convolution         0.01%      47.667us        41.19%     376.726ms      94.182ms             4  [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []  
    aten::_nnpack_spatial_convolution        41.18%     376.612ms        41.18%     376.677ms      94.169ms             4                                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], []]  
               torchvision::roi_align        14.45%     132.120ms        15.05%     137.659ms     137.659ms             1                                [[1, 256, 200, 200], [964, 5], [], [], [], [], []]  
                         aten::conv2d         0.00%       5.624us         6.63%      60.599ms      30.299ms             2                     [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], []]  
                    aten::convolution         0.00%      15.749us         6.62%      60.593ms      30.297ms             2             [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      15.668us         6.62%      60.577ms      30.289ms             2  [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], [], [], [], []  
                    aten::thnn_conv2d         0.00%       5.333us         6.62%      60.561ms      30.281ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 914.620ms

MPS

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference         0.02%       6.102ms       100.00%       34.789s       34.789s             1                                                                                []  
                     aten::where         0.00%       4.418us        98.23%       34.174s        8.544s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      12.873us        98.23%       34.174s        8.544s             4                                                                          [[1000]]  
                   aten::nonzero        98.22%       34.170s        98.23%       34.174s        8.544s             4                                                                          [[1000]]  
                       aten::cat         0.12%      40.344ms         0.12%      40.344ms       1.921ms            21                                                                          [[], []]  
                     aten::clamp         0.11%      36.795ms         0.11%      36.796ms      18.398ms             2                                                              [[1000, 91], [], []]  
                       aten::sub         0.09%      32.579ms         0.09%      32.579ms      32.579ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                     aten::clamp         0.06%      22.473ms         0.06%      22.485ms      11.242ms             2                                                           [[1000, 91, 2], [], []]  
                     aten::stack         0.00%      45.751us         0.06%      21.152ms       2.115ms            10                                                                          [[], []]  
                     aten::where         0.00%       8.002us         0.05%      15.979ms       7.990ms             2                                                                          [[4507]]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 34.789s

Env

Collecting environment information...
PyTorch version: 2.5.0a0+git7b2664e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.6 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.30.1
Libc version: N/A

Python version: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.6-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Max

Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-comprehensions==3.15.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] optree==0.12.1
[pip3] torch==2.5.0a0+git9d5c85c
[pip3] torchvision==0.20.0a0+0d80848
[conda] numpy                     1.26.0                   pypi_0    pypi
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] torch                     2.5.0a0+git9d5c85c           dev_0    <develop>
[conda] torchfix                  0.4.0                    pypi_0    pypi
[conda] torchvision               0.20.0a0+0d80848           dev_0    <develop>

@hvaara
Copy link
Contributor 8000
hvaara commented Aug 9, 2024

Now I wonder whether this is a regression or not...

I believe it is.

I might have a fix - first I just need to verify correctness. At least the runtime is down to something more reasonable.

MPS

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference         0.65%       9.151ms       100.00%        1.410s        1.410s          32 b        -148 b             1                                                                                []  
                    aten::linear        17.70%     249.478ms        17.70%     249.479ms     249.479ms           0 b           0 b             1                                               [[1, 12544], [1024, 12544], [1024]]  
                    aten::linear        17.42%     245.553ms        17.42%     245.558ms     245.558ms           0 b           0 b             1                                                     [[1, 1024], [91, 1024], [91]]  
                     aten::where         0.00%      26.791us         7.27%     102.506ms      51.253ms           0 b           0 b             2                                                                          [[4507]]  
             aten::nonzero_numpy         0.00%      12.583us         7.27%     102.479ms      51.239ms           0 b           0 b             2                                                                          [[4507]]  
                   aten::nonzero         7.20%     101.422ms         7.27%     102.450ms      51.225ms           0 b           0 b             2                                                                          [[4507]]  
                    aten::conv2d         0.00%       9.002us         5.59%      78.751ms      19.688ms           0 b           0 b             4                            [[1, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
               aten::convolution         0.00%      22.248us         5.59%      78.742ms      19.686ms           0 b           0 b             4                    [[1, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], []]  
              aten::_convolution         0.00%      19.832us         5.58%      78.720ms      19.680ms           0 b           0 b             4    [[1, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
          aten::_mps_convolution         5.58%      78.692ms         5.58%      78.700ms      19.675ms           0 b           0 b             4                            [[1, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.410s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
0