perf: Optimize format_word function performance by using direct unicode mapping #245

quake · 2025-03-22T12:44:16Z

This PR optimizes the format_word fn to improve performance when converting full-width characters to half-width.

Changes

Replaced character-by-character string operations with direct Unicode code point mapping
Eliminated unnecessary string allocations by using a more efficient character transformation approach

This change should result in lower CPU usage and memory allocation when processing text with many full-width characters. for example, run the format_json_2k benchmark on my local pc:

format_json_2k          time:   [25.914 ms 25.944 ms 25.976 ms]
                        change: [-11.110% -10.959% -10.807%] (p = 0.00 < 0.05)
                        Performance has improved.

…de mapping

huacnlee · 2025-03-23T02:12:18Z

autocorrect/src/rule/halfwidth.rs

+                // checked char is in range of fullwidth number and alphabetic
+                unsafe { char::from_u32_unchecked(c as u32 - 0xFEE0) }
+            }
+            '\u{3000}' => ' ',


你怎么知道我之前那个看不见的空格是 \u{3000} 的? 😂

https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)

U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space".

huacnlee · 2025-03-23T02:33:08Z

autocorrect/src/rule/halfwidth.rs

+    let out = text
+        .chars()
+        .map(|c| match c {
+            '\u{FF10}'..='\u{FF19}' | '\u{FF21}'..='\u{FF3A}' | '\u{FF41}'..='\u{FF5A}' => {


这个 range 有没有参考链接，我看一下范围的情况

参考在这里: https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)

看起来这里可以改成完整的 Fullwidth 表格，之前还有一些我漏掉的，比如：＠ -> @

https://www.unicode.org/charts/PDF/UFF00.pdf

我跟进，在你这个基础上补充一下

perf: Optimize format_word function performance by using direct unico…

d9802c8

…de mapping

huacnlee reviewed Mar 23, 2025

View reviewed c 10000 hanges

huacnlee reviewed Mar 23, 2025

View reviewed changes

huacnlee added 2 commits March 23, 2025 10:51

Add ref link

1e88f4b

.

b179119

huacnlee merged commit d28cb09 into huacnlee:main Mar 23, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Optimize format_word function performance by using direct unicode mapping #245

perf: Optimize format_word function performance by using direct unicode mapping #245

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perf: Optimize format_word function performance by using direct unicode mapping #245

perf: Optimize format_word function performance by using direct unicode mapping #245

Uh oh!

Conversation

Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!