Tags: sascristian/PytorchEnclosure
Tags
Update on "[quant][core][performance] Changed cudnn quantized conv2d … …impl to use inplace operations" Summary: This PR changed the implementation for the conv2d cudnn operator to use inplace ops. This increases the quantized conv operator's efficiency when bias and/or relu is used. Based on discussions, to support inplace operations, unique uids need to be assigned to the input and output even if it is stored at the same memory address. e.g., see the different uids in the current implementation assigned to conv_output.data_ptr Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. [ghstack-poisoned]
Update on "[quant][core][performance] Removed int_repr calls in quant… …ized conv2d cudnn implementation" Summary: This PR removes the int_repr() calls for the activation and weight tensors. Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly, the two tensors are equivalent except qint8 tensor has a qconfig. This avoids a copy of the qint8 tensor and significantly increases efficiency. Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Previous int8 benchmark: int8 benchmark result: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20 cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20 aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140 cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60 aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40 aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20 aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.424s Self CUDA time total: 20.877ms ``` Current int8 benchmark: ``` int8 benchmark result: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1 ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20 quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20 aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100 cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20 aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20 aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20 aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40 aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20 cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20 Self CPU time total: 18.116ms Self CUDA time total: 17.612ms ``` [ghstack-poisoned]
Update on "[Quant][core][refactorization] Refactored qconv_unpack.cpp… … into an implementation file and higher level call registration and definition file" Summary: This refactorization was necessary with the introduction of packed parameters for cudnn. Specifically, the unpack function for the 3 backends: fbgemm, qnnpack, and cudnn, is called using dynamic polymorphism, which was previously done in the file (formerly) /quantized/cpu/qconv_unpack.cpp. This part of the file was moved to the parent directory /quantized/ as it is relevant for both CPU (fbgemm & qnnpack) & CUDA (cudnn in this case). The remaining content is implementation specific to CPU, and the file was subsequently renamed to qconv_unpack_impl.cpp Differential Revision: [D34641680](https://our.internmc.facebook.com/intern/diff/D34641680) [ghstack-poisoned]
Update on "[Quant][core][gpu][improvement] Refactored implementation … …for conv2d_cudnn to use packed parameters" Summary: The previous implementation introduced in pytorch#70622 and expanded on in pytorch#72770, pytorch#73035, pytorch#73337 did not make use of packed parameters. This PR refactors the existing implementation to use packed parameters for cudnn conv2d in the same manner as was done for qnnpack and fbgemm in the following files: aten/src/ATen/native/quantized/cpu/fbgemm_utils.h. aten/src/ATen/native/quantized/cpu/qnnpack_utils.h. aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp. aten/src/ATen/native/quantized/cpu/qconv_unpack.cpp (note this file will be refactored into two files (one located in /quantized/ and the other in /quantized/cpu/) in a subsequent PR, as we are currently using the dispatch introduced in this file for the cudnn operator as well) This allows for all cudnn operators to be registered as quantized::conv2d, quantized::conv2d_relu, quantized::conv2d_prepack, and to allow the dispatcher to determine which backend to use (e.g., cuda/cudnn, fbgemm, or qnnpack). Test cases were also modified to adhere to the methodology of using prepacking the weight & bias prior to passing it into the conv2d operator. We also ensured that the refactorization did not result in a reduction in speed by verifying that the computation times in the benchmark test case (see test plan below) are consistent with the results pre-refactorization. Note the following: apply_impl is now what was formerly raw_cudnn_convolution_forward apply_impl_helper is now what was formerly raw_cudnn_convolution_forward_out Test plan: In pytorch main directory, execute ``` python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` for accuracy testing and ``` python test/test_quantization.py TestQuantizedConv.test_benchmark ``` for benchmark testing. Differential Revision: [D34803275](https://our.internmc.facebook.com/intern/diff/D34803275) [ghstack-poisoned]
[Kineto] Manual Submodule Update (pytorch#73858) Summary: Pull Request resolved: pytorch#73858 The newer version of Kineto has changes to handle generic activities (such as RocTracer generic activities), so we can remove the older USE_KINETO_UPDATED macro and implementation of flow.linkedActivity A86A . This patch should bring Kineto back in sync on PyTorch CI. Test Plan: PyTorch OSS CI needs to pass for this submodule update of third_party/kineto repo. With ciflow/cuda enabled. Reviewed By: chaekit Differential Revision: D34689078 Pulled By: aaronenyeshi fbshipit-source-id: 4588ead174ab23ecd95facc3a50702b069d423c3
Merge remote-tracking branch 'upstream/master' into remove-qr
Merge remote-tracking branch 'upstream/master' into remove-symeig
Merge remote-tracking branch 'upstream/master' into remove-solve
Merge remote-tracking branch 'upstream/master' into remove-eig
Merge remote-tracking branch 'upstream/master' into remove-matrix_rank
PreviousNext