Tags · sascristian/PytorchEnclosure

ciflow/macos/73857

Update on "[quant][core][performance] Changed cudnn quantized conv2d …

…impl to use inplace operations"

Summary:
This PR changed the implementation for the conv2d cudnn operator to use inplace ops.
This increases the quantized conv operator's efficiency when bias and/or relu is used.
Based on discussions, to support inplace operations, unique uids need to be assigned
to the input and output even if it is stored at the same memory address.
e.g., see the different uids in the current implementation assigned to conv_output.data_ptr

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

[ghstack-poisoned]

Mar 11, 2022
8e2bab1
zip
tar.gz

ciflow/macos/73849

Update on "[quant][core][performance] Removed int_repr calls in quant…

…ized conv2d cudnn implementation"

Summary:
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Previous int8 benchmark:
int8 benchmark result:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      quantized::conv2d        99.37%        2.408s        99.44%        2.410s     120.500ms       0.000us         0.00%       6.142ms     307.100us            20
                                  cudaDeviceSynchronize         0.48%      11.747ms         0.48%      11.747ms      11.747ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         0.07%       1.731ms        99.51%        2.412s     120.587ms       0.000us         0.00%       6.142ms     307.100us            20
                                            aten::empty         0.02%     501.000us         0.02%     501.000us       3.579us       0.000us         0.00%       0.000us       0.000us           140
                                       cudaLaunchKernel         0.02%     452.000us         0.02%     452.000us       7.533us       0.000us         0.00%       0.000us       0.000us            60
                                         aten::int_repr         0.01%     351.000us         0.04%     886.000us      22.150us       2.700ms        12.93%       2.700ms      67.500us            40
                          aten::_empty_affine_quantized         0.01%     172.000us         0.01%     172.000us       8.600us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.01%     139.000us         0.01%     254.000us      12.700us       3.442ms        16.49%       3.442ms     172.100us            20
                                          aten::q_scale         0.00%      62.000us         0.00%      62.000us       1.550us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.00%      61.000us         0.00%     112.000us       5.600us       0.000us         0.00%       0.000us       0.000us            20
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```

Current int8 benchmark:
```
int8 benchmark result:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  cudaDeviceSynchronize        83.02%      15.241ms        83.02%      15.241ms      15.241ms       0.000us         0.00%       0.000us       0.000us             1
                                          ProfilerStep*         7.54%       1.384ms        16.48%       3.026ms     151.300us       0.000us         0.00%       3.460ms     173.000us            20
                                      quantized::conv2d         4.47%     821.000us         8.89%       1.632ms      81.600us       0.000us         0.00%       3.460ms     173.000us            20
                                            aten::empty         1.43%     262.000us         1.43%     262.000us       2.620us       0.000us         0.00%       0.000us       0.000us           100
                                       cudaLaunchKernel         1.05%     193.000us         1.05%     193.000us       9.650us       0.000us         0.00%       0.000us       0.000us            20
                                            aten::fill_         0.89%     164.000us         1.94%     357.000us      17.850us       3.460ms        19.64%       3.460ms     173.000us            20
                          aten::_empty_affine_quantized         0.86%     157.000us         0.86%     157.000us       7.850us       0.000us         0.00%       0.000us       0.000us            20
                                          aten::q_scale         0.32%      59.000us         0.32%      59.000us       1.475us       0.000us         0.00%       0.000us       0.000us            40
                                            aten::zeros         0.29%      53.000us         0.50%      92.000us       4.600us       0.000us         0.00%       0.000us       0.000us            20
                                        cudaEventRecord         0.11%      20.000us         0.11%      20.000us       1.000us       0.000us         0.00%       0.000us       0.000us            20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```

[ghstack-poisoned]

Mar 11, 2022
b7b90fc
zip
tar.gz

ciflow/macos/73773

Update on "[Quant][core][refactorization] Refactored qconv_unpack.cpp…

… into an implementation file and higher level call registration and definition file"

Summary:
This refactorization was necessary with the introduction of packed parameters for cudnn.
Specifically, the unpack function for the 3 backends: fbgemm, qnnpack, and cudnn, is called
using dynamic polymorphism, which was previously done in the file (formerly) /quantized/cpu/qconv_unpack.cpp.
This part of the file was moved to the parent directory /quantized/ as it is relevant for both
CPU (fbgemm & qnnpack) & CUDA (cudnn in this case). The remaining content is implementation specific to
CPU, and the file was subsequently renamed to qconv_unpack_impl.cpp

Differential Revision: [D34641680](https://our.internmc.facebook.com/intern/diff/D34641680)

[ghstack-poisoned]

Mar 11, 2022
e115f91
zip
tar.gz

ciflow/macos/73510

Update on "[Quant][core][gpu][improvement] Refactored implementation …

…for conv2d_cudnn to use packed parameters"

Summary:
The previous implementation introduced in pytorch#70622
and expanded on in pytorch#72770,
pytorch#73035, pytorch#73337
did not make use of packed parameters. This PR refactors the existing
implementation to use packed parameters for cudnn conv2d in the same manner
as was done for qnnpack and fbgemm in the following files:
aten/src/ATen/native/quantized/cpu/fbgemm_utils.h.
aten/src/ATen/native/quantized/cpu/qnnpack_utils.h.
aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp.
aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp (note this file will be
refactored into two files (one located in /quantized/ and the other in
/quantized/cpu/) in a subsequent PR, as we are currently using the dispatch introduced
in this file for the cudnn operator as well)

This allows for all cudnn operators to be registered as quantized::conv2d,
quantized::conv2d_relu, quantized::conv2d_prepack, and to allow the dispatcher
to determine which backend to use (e.g., cuda/cudnn, fbgemm, or qnnpack).

Test cases were also modified to adhere to the methodology of using
prepacking the weight & bias prior to passing it into the conv2d operator.

We also ensured that the refactorization did not result in a reduction in speed
by verifying that the computation times in the benchmark test case (see test plan below)
are consistent with the results pre-refactorization.

Note the following:
apply_impl is now what was formerly raw_cudnn_convolution_forward
apply_impl_helper is now what was formerly raw_cudnn_convolution_forward_out

Test plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.

Differential Revision: [D34803275](https://our.internmc.facebook.com/intern/diff/D34803275)

[ghstack-poisoned]

Mar 11, 2022
efa053f
zip
tar.gz

ciflow/cuda/73858

[Kineto] Manual Submodule Update (pytorch#73858)

Summary:
Pull Request resolved: pytorch#73858

The newer version of Kineto has changes to handle generic activities (such as RocTracer generic activities), so we can remove the older USE_KINETO_UPDATED macro and implementation of flow.linkedActivity
A86A
.

This patch should bring Kineto back in sync on PyTorch CI.

Test Plan: PyTorch OSS CI needs to pass for this submodule update of third_party/kineto repo. With ciflow/cuda enabled.

Reviewed By: chaekit

Differential Revision: D34689078

Pulled By: aaronenyeshi

fbshipit-source-id: 4588ead174ab23ecd95facc3a50702b069d423c3

Mar 11, 2022
915405d
zip
tar.gz

ciflow/cuda/70989

Merge remote-tracking branch 'upstream/master' into remove-qr

Mar 11, 2022
bd4797f
zip
tar.gz

ciflow/cuda/70988

Merge remote-tracking branch 'upstream/master' into remove-symeig

Mar 11, 2022
8bba80d
zip
tar.gz

ciflow/cuda/70986

Merge remote-tracking branch 'upstream/master' into remove-solve

Mar 11, 2022
19e4d73
zip
tar.gz

ciflow/cuda/70982

Merge remote-tracking branch 'upstream/master' into remove-eig

Mar 11, 2022
0745c32
zip
tar.gz

ciflow/cuda/70981

Merge remote-tracking branch 'upstream/master' into remove-matrix_rank

Mar 11, 2022
5d38ad9
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ciflow/macos/73857

ciflow/macos/73849

ciflow/macos/73773

ciflow/macos/73510

ciflow/cuda/73858

ciflow/cuda/70989

ciflow/cuda/70988

ciflow/cuda/70986

ciflow/cuda/70982

ciflow/cuda/70981

Tags: sascristian/PytorchEnclosure