Hoist all bounds checks out of loops #31

jub0bs · 2025-05-17T16:35:24Z

Some benchmark results (no changes in terms of allocations):

$ benchstat -filter '.name:Simple' old new
goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                     │     old     │                 new                 │
                     │   sec/op    │   sec/op     vs base                │
Simple/ASCII-8         167.5n ± 3%   142.4n ± 4%  -14.96% (p=0.000 n=20)
Simple/French-8        267.1n ± 2%   220.3n ± 3%  -17.49% (p=0.000 n=20)
Simple/Nordic-8        432.6n ± 2%   346.7n ± 3%  -19.86% (p=0.000 n=20)
Simple/Long_lead-8     326.2n ± 0%   275.9n ± 0%  -15.41% (p=0.000 n=20)
Simple/Long_middle-8   520.4n ± 0%   441.6n ± 0%  -15.15% (p=0.000 n=20)
Simple/Long_trail-8    721.6n ± 0%   608.3n ± 0%  -15.70% (p=0.000 n=20)
Simple/Long_diff-8     8.549µ ± 1%   6.537µ ± 0%  -23.53% (p=0.000 n=20)
Simple/Tibetan-8       562.1n ± 2%   493.3n ± 3%  -12.25% (p=0.000 n=20)
geomean                571.6n        475.2n       -16.86%

jub0bs · 2025-05-18T11:14:36Z

Comparison to dgryski's implementation

Before

$ benchstat -col '/impl@(dgryski agniva)' old
goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                   │   dgryski    │               agniva                │
                   │    sec/op    │   sec/op     vs base                │
All/case=ASCII-8      335.6n ± 0%   163.1n ± 2%  -51.41% (p=0.000 n=20)
All/case=French-8     616.0n ± 0%   267.0n ± 2%  -56.66% (p=0.000 n=20)
All/case=Nordic-8    1184.0n ± 1%   429.3n ± 1%  -63.74% (p=0.000 n=20)
All/case=Tibetan-8    952.8n ± 2%   555.7n ± 1%  -41.67% (p=0.000 n=20)
geomean               694.9n        319.2n       -54.06%

After

$ benchstat -col '/impl@(dgryski agniva)' new
goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                   │   dgryski    │               agniva                │
                   │    sec/op    │   sec/op     vs base                │
All/case=ASCII-8      342.1n ± 1%   131.2n ± 3%  -61.63% (p=0.000 n=20)
All/case=French-8     628.1n ± 0%   221.9n ± 2%  -64.68% (p=0.000 n=20)
All/case=Nordic-8    1185.5n ± 1%   335.0n ± 2%  -71.74% (p=0.000 n=20)
All/case=Tibetan-8    974.5n ± 1%   493.4n ± 1%  -49.37% (p=0.000 n=20)
geomean               705.8n        263.4n       -62.68%

Comparison to arbovm's implementation

Before

$ benchstat -col '/impl@(arbovm agniva)' old
goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                   │    arbovm    │               agniva                │
                   │    sec/op    │   sec/op     vs base                │
All/case=ASCII-8      356.8n ± 1%   163.1n ± 2%  -54.30% (p=0.000 n=20)
All/case=French-8     644.8n ± 0%   267.0n ± 2%  -58.60% (p=0.000 n=20)
All/case=Nordic-8    1208.5n ± 1%   429.3n ± 1%  -64.48% (p=0.000 n=20)
All/case=Tibetan-8    976.7n ± 3%   555.7n ± 1%  -43.10% (p=0.000 n=20)
geomean               721.9n        319.2n       -55.78%

After

$ benchstat -col '/impl@(arbovm agniva)' new
goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                   │    arbovm    │               agniva                │
                   │    sec/op    │   sec/op     vs base                │
All/case=ASCII-8      358.6n ± 1%   131.2n ± 3%  -63.40% (p=0.000 n=20)
All/case=French-8     653.3n ± 1%   221.9n ± 2%  -66.04% (p=0.000 n=20)
All/case=Nordic-8    1219.0n ± 1%   335.0n ± 2%  -72.52% (p=0.000 n=20)
All/case=Tibetan-8    980.0n ± 2%   493.4n ± 1%  -49.65% (p=0.000 n=20)
geomean               727.3n        263.4n       -63.79%

agnivade

This is very clever, thank you! Hoisting the bounds checks from the main loops looks good to me.

I wonder how much performance are we getting in hoisting out the checks in the trimLongestCommonSuffix and trimLongestCommonPrefix functions. Because it looks like we have to pay quite a large cost to readability.

Could you re-run the benchmarks by reverting the logic of the two functions?

levenshtein.go

jub0bs · 2025-05-19T06:17:21Z

I wonder how much performance are we getting in hoisting out the checks in the trimLongestCommonSuffix and trimLongestCommonPrefix functions.

I guess the performance gain largely depends on the two input strings: if they have a long common prefix and/or a long common suffix, avoiding unnecessary bounds checks in those loops is likely to have a non-negligible benefit on performance.

[...] it looks like we have to pay quite a large cost to readability.

I don't think readability suffers that much, esp. since I've abstracted the two loops to functions with clarifying names.

Could you re-run the benchmarks by reverting the logic of the two functions?

Here are some results after reverting the commit in question:

goos: darwin
goarch: amd64
pkg: github.com/agnivade/levenshtein
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                     │     new     │                new2                 │
                     │   sec/op    │   sec/op     vs base                │
Simple/ASCII-8         142.4n ± 4%   143.4n ± 4%        ~ (p=0.344 n=20)
Simple/French-8        220.3n ± 3%   226.4n ± 1%   +2.75% (p=0.000 n=20)
Simple/Nordic-8        346.7n ± 3%   336.0n ± 1%   -3.07% (p=0.007 n=20)
Simple/Long_lead-8     275.9n ± 0%   338.1n ± 0%  +22.56% (p=0.000 n=20)
Simple/Long_middle-8   441.6n ± 0%   501.2n ± 0%  +13.51% (p=0.000 n=20)
Simple/Long_trail-8    608.3n ± 0%   631.7n ± 0%   +3.83% (p=0.000 n=20)
Simple/Long_diff-8     6.537µ ± 0%   5.925µ ± 1%   -9.37% (p=0.000 n=20)
Simple/Tibetan-8       493.3n ± 3%   509.9n ± 1%   +3.35% (p=0.001 n=20)
geomean                475.2n        493.7n        +3.88%

The commit is worth keeping, in my opinion. What do you think?

jub0bs · 2025-05-19T06:18:59Z

I've changed the names of the sub-benchmarks in order for the results to be easier to analyse with benchstat.

agnivade · 2025-05-19T06:55:01Z

I guess the performance gain largely depends on the two input strings: if they have a long common prefix and/or a long common suffix,

Yes. Could you add the additional benchmarks you added for the common prefix and suffixes? I don't see them in the PR.

jub0bs · 2025-05-19T06:56:41Z

Yes. Could you add the additional benchmarks you added for the common prefix and suffixes? I don't see them in the PR.

I haven't added any new cases yet, but I can.

Edit: Actually, aren't the cases named "Long lead" and "Long trail" already testing exactly this? Are new benchmark cases needed?

Note: functions trimLongestCommonPrefix and trimLongestCommonSuffix are implemented in such a way as to make them eligible for inlining.

agnivade · 2025-05-19T08:04:04Z

My bad there. Somehow I forgot that the long lead and long trail cases already exist. Thanks!

jub0bs · 2025-05-19T08:53:22Z

Great! Thanks for taking the time to review and accept my changes.

agnivade · 2025-05-19T13:54:14Z

Thanks for spending the time to improve the library! Hope it's useful to you.

jub0bs changed the title ~~Hoist all but one bounds check out of loops~~ Optimize performance May 17, 2025

jub0bs force-pushed the master branch from 039af6b to 50adb1e Compare May 18, 2025 10:57

jub0bs changed the title ~~Optimize performance~~ Hoist all bounds checks out of loops May 18, 2025

jub0bs force-pushed the master branch from 50adb1e to 2301ca7 Compare May 18, 2025 11:00

jub0bs force-pushed the master branch from 2301ca7 to 6f71534 Compare May 18, 2025 11:24

agnivade reviewed May 19, 2025

View reviewed changes

levenshtein.go Outdated Show resolved Hide resolved

jub0bs force-pushed the master branch from 6f71534 to 9e0f54f Compare May 19, 2025 06:39

jub0bs added 4 commits May 19, 2025 09:17

Improve names of sub-benchmarks for benchstat

a3018a7

Hoist bounds checks out of trimming loops

bedded4

Note: functions trimLongestCommonPrefix and trimLongestCommonSuffix are implemented in such a way as to make them eligible for inlining.

Hoist remaining bounds checks out of loops

964c3f0

Factorize increment

5480654

jub0bs force-pushed the master branch from 9e0f54f to 5480654 Compare May 19, 2025 07:18

agnivade approved these changes May 19, 2025

View reviewed changes

agnivade merged commit 4208675 into agnivade:master May 19, 2025
6 checks passed

jub0bs mentioned this pull request May 20, 2025

Update the README with new benchmark results #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hoist all bounds checks out of loops #31

Hoist all bounds checks out of loops #31

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hoist all bounds checks out of loops #31

Hoist all bounds checks out of loops #31

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Comparison to dgryski's implementation

Before

After

Comparison to arbovm's implementation

Before

After

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!