-
Notifications
You must be signed in to change notification settings - Fork 224
Added SVE intrinsics for postGemmPart function #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SVE intrinsics for postGemmPart function #3271
Conversation
@shubhamsvc I haven't checked all the places where this is used, but I think the 'exp' function as used in RBF kernels requires high-precision computations - otherwise quality of the results might degrade significantly in some situations (particularly Newton-type methods that rely on those computations). The exp function from MKL that gets called on x86 is called with the 'high accuracy' mode as you can see here:
Could you provide some information about the accuracy level of the 'exp' function here? Is there some reference paper analyzing the method? @Vika-F Could you provide some info about which algorithms use this function? |
@david-cortes-intel @shubhamsvc
Besides RBF this function is used in GBT, prediction stage; but the accuracy requirements are lower there I guess. vExp is also used in EM GMM, AdaBoost and LogitBoost algorithms, but those are not part of sklearn-intelex, the importance of those algorithms is lower. |
But to clarify: the PR is not modifying the 'vExp' function in the '_ref' service file, it's just modifying it in this particular file (for RBF kernel). I understand RBF is used in SVMs (not sure if the algorithm there degrades with lower-precision exp), and might be used in the future in spectral clustering (I guess lower exp precision shouldn't be much of an issue there), but is there some other place where these RBF kernels might be called? |
@david-cortes-intel But potentially it can be used in any algorithm that can benefit from kernels (we just have only SVM for now). Sklearn has kernel ridge regression, for example, and maybe something else. @shubhamsvc |
@david-cortes-intel @Vika-F Thank you for quick response. What ULP accuracy is expected for exp(x) in this case? |
For the general vExp function which is used also in logistic regression, I don't think it'd be advantageous to use something with more than 1ULP of error even if much faster, as numerical inaccuracies there do have a noticeable effect on convergence speed and quality of results. For SVM RBF kernel in specific I am not sure - perhaps so 8000 meone more familiar with the underlying algorithm could comment. Nevertheless, would still be ideal to know what's the accuracy level of this 'exp' function, at the very least to leave it as a comment in the code. Perhaps one potential next step could be to conduct tests with RBF SVMs using the sklbench repository (which requires sklearnex built against this oneDAL branch) before and after this PR and see how the quality metrics change. @Alexsandruss Could you provide instructions for running the SVM RBF cases from sklbench? @rakshithgb-fujitsu Could you perhaps try out the changes before/after this PR with SVM RBF kernel examples in ARM hardware? |
To support the importance of RBF, it is already exposed in sklearnex in the onedal module, https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/onedal/primitives/kernel_functions.py#L90 and could easily become a publicly-usable sklearnex function in short order (replicating sklearn functionality: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html) |
I added the PR checklist to guide reviewing/development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds SVE-based implementations for the postGemmPart function targeting the RBF kernel, accelerating vectorized computations on ARM (Graviton3) for both float and double data types.
- Adds conditional compilation and header inclusion for ARM SVE support.
- Implements an inline exponential function (exp_ps_sve) for float32_t using SVE intrinsics.
- Introduces unrolled loop implementations in the postGemmPart functions for both float and double types using SVE intrinsics.
Comments suppressed due to low confidence (1)
cpp/daal/src/algorithms/kernel_function/kernel_function_rbf_helper.h:239
- [nitpick] The variable name 'exp' may conflict with standard library function names; consider renaming it to 'exp_result' or a similar descriptive identifier.
svfloat32_t exp = exp_ps_sve(pg, tmp);
|
||
// Algorithm starts here | ||
svfloat32_t t0 = svmul_f32_z(pg, src, log2_e); // y = x * log2(e) | ||
svfloat32_t t1 = svrintm_f32_z(pg, t0); // rount to int (float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment contains a typo: 'rount' should be corrected to 'round'.
svfloat32_t t1 = svrintm_f32_z(pg, t0); // rount to int (float) | |
svfloat32_t t1 = svrintm_f32_z(pg, t0); // round to int (float) |
Copilot uses AI. Check for mistakes.
8000 | #if (__CPUID__(DAAL_CPU) == __sve__) | |
|
||
//exp function for float32_t using SVE intrinsics | ||
inline svfloat32_t exp_ps_sve(svbool_t& pg, svfloat32_t& src) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] If the parameters in exp_ps_sve are not intended to be modified within the function, consider declaring them as const (e.g., 'const svbool_t&' and 'const svfloat32_t&') to make the intent explicit.
inline svfloat32_t exp_ps_sve(svbool_t& pg, svfloat32_t& src) { | |
inline svfloat32_t exp_ps_sve(const svbool_t& pg, const svfloat32_t& src) { |
Copilot uses AI. Check for mistakes.
Yes, Shubham is from Fujitsu as well. We'll share all the benchmark details and validate it on arm hardware. |
@david-cortes-intel The current SVE implementation of exp had low accuracy, so it has been removed in this PR. I am working on improving the ULP accuracy of exp and will include an updated version in a subsequent PR. |
CI failures do not look related to this PR, but @shubhamsvc please lint the files according to the instructions and merge the main branch here. |
* Added SVE intrinsics for postGemm function * Removed SVE implementation of float exponential due to accuracy issues * Format code using clang-format --------- Co-authored-by: shubham.chaudhari <Shubham.Chaudhari@fujitsu.com>
This pull request adds SVE-based implementations of postGemmPart function for both float and double types to accelerate vectorized computation on ARM.
Average Performance (on Graviton3)
Float: ~1.26× speedup over scalar
Double: ~1.19× speedup over scalar
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance