Added SVE intrinsics for postGemmPart function #3271

shubhamsvc · 2025-06-26T07:02:33Z

This pull request adds SVE-based implementations of postGemmPart function for both float and double types to accelerate vectorized computation on ARM.

Average Performance (on Graviton3)

Float: ~1.26× speedup over scalar
Double: ~1.19× speedup over scalar

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

david-cortes-intel · 2025-06-26T10:28:01Z

@shubhamsvc I haven't checked all the places where this is used, but I think the 'exp' function as used in RBF kernels requires high-precision computations - otherwise quality of the results might degrade significantly in some situations (particularly Newton-type methods that rely on those computations).

The exp function from MKL that gets called on x86 is called with the 'high accuracy' mode as you can see here:

oneDAL/cpp/daal/src/externals/service_math_mkl.h

Line 126 in 13c979b

    
           __DAAL_MKLFN_CALL_MATH(vmdExp, ((int)n, in, out, (VML_HA | VML_FTZDAZ_ON | VML_ERRMODE_IGNORE)));

Could you provide some information about the accuracy level of the 'exp' function here? Is there some reference paper analyzing the method?

@Vika-F Could you provide some info about which algorithms use this function?

Vika-F · 2025-06-26T10:52:39Z

@david-cortes-intel @shubhamsvc

@Vika-F Could you provide some info about which algorithms use this function?

Besides RBF this function is used in GBT, prediction stage; but the accuracy requirements are lower there I guess.

vExp is also used in EM GMM, AdaBoost and LogitBoost algorithms, but those are not part of sklearn-intelex, the importance of those algorithms is lower.

david-cortes-intel · 2025-06-26T10:56:35Z

@david-cortes-intel @shubhamsvc

@Vika-F Could you provide some info about which algorithms use this function?

Besides RBF this function is used in GBT, prediction stage; but the accuracy requirements are lower there I guess.

vExp is also used in EM GMM, AdaBoost and LogitBoost algorithms, but those are not part of sklearn-intelex, the importance of those algorithms is lower.

But to clarify: the PR is not modifying the 'vExp' function in the '_ref' service file, it's just modifying it in this particular file (for RBF kernel).

I understand RBF is used in SVMs (not sure if the algorithm there degrades with lower-precision exp), and might be used in the future in spectral clustering (I guess lower exp precision shouldn't be much of an issue there), but is there some other place where these RBF kernels might be called?

Vika-F · 2025-06-26T17:14:46Z

@david-cortes-intel
Sorry, there was a misunderstanding.
No RBF is not used anywhere in oneDAL except SVM now.

But potentially it can be used in any algorithm that can benefit from kernels (we just have only SVM for now). Sklearn has kernel ridge regression, for example, and maybe something else.

@shubhamsvc
And I also think that it would be beneficial to have this exp code in vExp or other similar primitive (in case its accuracy is greater than 0.5 ULP) to be able to reuse it in other algorithms.
The code of corresponding vExp is located here:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/externals/service_math_ref.h#L224

shubhamsvc · 2025-06-27T13:59:25Z

@david-cortes-intel @Vika-F Thank you for quick response. What ULP accuracy is expected for exp(x) in this case?

david-cortes-intel · 2025-06-27T14:58:31Z

For the general vExp function which is used also in logistic regression, I don't think it'd be advantageous to use something with more than 1ULP of error even if much faster, as numerical inaccuracies there do have a noticeable effect on convergence speed and quality of results.

For SVM RBF kernel in specific I am not sure - perhaps so 8000 meone more familiar with the underlying algorithm could comment. Nevertheless, would still be ideal to know what's the accuracy level of this 'exp' function, at the very least to leave it as a comment in the code.

Perhaps one potential next step could be to conduct tests with RBF SVMs using the sklbench repository (which requires sklearnex built against this oneDAL branch) before and after this PR and see how the quality metrics change. @Alexsandruss Could you provide instructions for running the SVM RBF cases from sklbench? @rakshithgb-fujitsu Could you perhaps try out the changes before/after this PR with SVM RBF kernel examples in ARM hardware?

icfaust · 2025-06-28T22:50:46Z

To support the importance of RBF, it is already exposed in sklearnex in the onedal module, https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/onedal/primitives/kernel_functions.py#L90 and could easily become a publicly-usable sklearnex function in short order (replicating sklearn functionality: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html)

icfaust · 2025-06-28T22:53:57Z

I added the PR checklist to guide reviewing/development.

Copilot

Pull Request Overview

This pull request adds SVE-based implementations for the postGemmPart function targeting the RBF kernel, accelerating vectorized computations on ARM (Graviton3) for both float and double data types.

Adds conditional compilation and header inclusion for ARM SVE support.
Implements an inline exponential function (exp_ps_sve) for float32_t using SVE intrinsics.
Introduces unrolled loop implementations in the postGemmPart functions for both float and double types using SVE intrinsics.

Comments suppressed due to low confidence (1)

cpp/daal/src/algorithms/kernel_function/kernel_function_rbf_helper.h:239

[nitpick] The variable name 'exp' may conflict with standard library function names; consider renaming it to 'exp_result' or a similar descriptive identifier.

        svfloat32_t exp = exp_ps_sve(pg, tmp);

Copilot · 2025-06-28T22:54:42Z

cpp/daal/src/algorithms/kernel_function/kernel_function_rbf_helper.h

+
+    // Algorithm starts here
+    svfloat32_t t0 = svmul_f32_z(pg, src, log2_e);  // y = x * log2(e)
+    svfloat32_t t1 = svrintm_f32_z(pg, t0);         // rount to int (float)


The comment contains a typo: 'rount' should be corrected to 'round'.

Suggested change

svfloat32_t t1 = svrintm_f32_z(pg, t0); // rount to int (float)

svfloat32_t t1 = svrintm_f32_z(pg, t0); // round to int (float)

Copilot · 2025-06-28T22:54:43Z

cpp/daal/src/algorithms/kernel_function/kernel_function_rbf_helper.h

+#if (__CPUID__(DAAL_CPU) == __sve__)
+
+//exp function for float32_t using SVE intrinsics
+inline svfloat32_t exp_ps_sve(svbool_t& pg, svfloat32_t& src) {


[nitpick] If the parameters in exp_ps_sve are not intended to be modified within the function, consider declaring them as const (e.g., 'const svbool_t&' and 'const svfloat32_t&') to make the intent explicit.

Suggested change

inline svfloat32_t exp_ps_sve(svbool_t& pg, svfloat32_t& src) {

inline svfloat32_t exp_ps_sve(const svbool_t& pg, const svfloat32_t& src) {

rakshithgb-fujitsu · 2025-06-30T05:40:13Z

For the general vExp function which is used also in logistic regression, I don't think it'd be advantageous to use something with more than 1ULP of error even if much faster, as numerical inaccuracies there do have a noticeable effect on convergence speed and quality of results.

For SVM RBF kernel in specific I am not sure - perhaps someone more familiar with the underlying algorithm could comment. Nevertheless, would still be ideal to know what's the accuracy level of this 'exp' function, at the very least to leave it as a comment in the code.

Perhaps one potential next step could be to conduct tests with RBF SVMs using the sklbench repository (which requires sklearnex built against this oneDAL branch) before and after this PR and see how the quality metrics change. @Alexsandruss Could you provide instructions for running the SVM RBF cases from sklbench? @rakshithgb-fujitsu Could you perhaps try out the changes before/after this PR with SVM RBF kernel examples in ARM hardware?

Yes, Shubham is from Fujitsu as well. We'll share all the benchmark details and validate it on arm hardware.

shubhamsvc · 2025-07-05T10:51:18Z

@david-cortes-intel The current SVE implementation of exp had low accuracy, so it has been removed in this PR. I am working on improving the ULP accuracy of exp and will include an updated version in a subsequent PR.

david-cortes-intel · 2025-07-07T11:20:04Z

CI failures do not look related to this PR, but @shubhamsvc please lint the files according to the instructions and merge the main branch here.

* Added SVE intrinsics for postGemm function * Removed SVE implementation of float exponential due to accuracy issues * Format code using clang-format --------- Co-authored-by: shubham.chaudhari <Shubham.Chaudhari@fujitsu.com>

shubhamsvc requested review from Alexsandruss, Alexandr-Solovev, avolkov-intel, KateBlueSky, icfaust, ethanglaser and david-cortes-intel as code owners June 26, 2025 07:02

icfaust requested a review from Copilot June 28, 2025 22:54

icfaust added the enhancement label Jun 28, 2025

Copilot AI reviewed Jun 28, 2025

View reviewed changes

shubhamsvc force-pushed the postgemm_sve branch from 3e1bdef to a564a16 Compare July 5, 2025 10:45

shubhamsvc force-pushed the postgemm_sve branch from a564a16 to cb60253 Compare July 5, 2025 11:04

shubham.chaudhari added 3 commits July 8, 2025 16:31

Added SVE intrinsics for postGemm function

4f6b842

Removed SVE implementation of float exponential due to accuracy issues

73f1d2b

Format code using clang-format

017343b

shubhamsvc force-pushed the postgemm_sve branch from cb60253 to 017343b Compare July 8, 2025 11:04

david-cortes-intel approved these changes Jul 8, 2025

View reviewed changes

david-cortes-intel merged commit bd8108b into uxlfoundation:main Jul 8, 2025
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added SVE intrinsics for postGemmPart function #3271

Added SVE intrinsics for postGemmPart function #3271

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	svfloat32_t t1 = svrintm_f32_z(pg, t0); // rount to int (float)
	svfloat32_t t1 = svrintm_f32_z(pg, t0); // round to int (float)

	inline svfloat32_t exp_ps_sve(svbool_t& pg, svfloat32_t& src) {
	inline svfloat32_t exp_ps_sve(const svbool_t& pg, const svfloat32_t& src) {

Added SVE intrinsics for postGemmPart function #3271

Added SVE intrinsics for postGemmPart function #3271

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!