perf: speed up line detection #847

karfau · 2025-02-25T23:10:59Z

enforce timeouts on tests and specifically for the one not using the default normalizeLineEndings, so it actually fails.
improve reused regular expression to not contain .* twice, to speed up processing larger files containing certain Unicode chars as demonstrated in Increased processing times when using \u2028 and \u2029 #838

test/dom-parser.test.js

when the `locator` is not disabled

codecov · 2025-02-25T23:19:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.12%. Comparing base (bfb5c44) to head (23e2aaf).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #847   +/-   ##
=======================================
  Coverage   95.12%   95.12%           
=======================================
  Files           8        8           
  Lines        2196     2196           
  Branches      577      577           
=======================================
  Hits         2089     2089           
  Misses        107      107

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

karfau · 2025-02-25T23:24:19Z

@kboshold this one is an actual fix for the performance issue from your initial example, even when not normalizing the input.
Also the timeout of the test is now enforced.

Ponynjaa

Just two minor things I would change: the test-descriptions on the test you copied and the new one don't describe what's actually being tested.

Other than that it looks really good! I was a little confused at first when I saw your changes to the regex and loop, but after playing around with it for a little bit and testing the outcomes it seems to work exactly as before but without matching the content inbetween line-endings for no reason. Good job dude!

Ponynjaa · 2025-02-26T21:37:44Z

test/dom-parser.test.js

@@ -330,7 +331,20 @@ describe('DOMParser', () => {
 			const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;
 			const doc = parser.parseFromString(source, MIME_TYPE.XML_TEXT);
 			expect(new XMLSerializer().serializeToString(doc)).toEqual(source.replace(/[\u0085\u2028\u2029]/g, '\n'));
-		}, 500);
+		});
+		test('should be able to open documents with alternative whitespace without creating a bottleneck and replacing them with \\n', () => {


and replacing them with \n

this description doesn't make sense for this test

Ponynjaa · 2025-02-26T21:39:23Z

test/dom-parser.test.js

@@ -330,7 +331,20 @@ describe('DOMParser', () => {
 			const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;


without creating a bottleneck

The test description here actually also doesn't make sense, maybe you can rephrase it when you are already at it?

Yes, somehow the test I copied is also covering the one that @kboshold added, so I merged them into one and corrected the description.

kboshold · 2025-02-27T06:44:30Z

test/dom-parser.test.js

+			const >

            
+			const normalizeLineEndings = jest.fn((source) => source);
+			const { parser } = getTestParser({ onError, normalizeLineEndings });
+			const source = `<root>${'A'.repeat(15000)}\u2029${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;


I think it would make more sense to have a \n as the first character, since the test currently has \u2029 twice.

Solution:

const source = `<root>${'A'.repeat(15000)}\n${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;

I don't think it's such big of a difference, but hey, I applied your suggestion.

karfau · 2025-02-27T21:27:16Z

I applied all the suggested changes, so please approve it.
I would say the PR is already an improvement that we should get landed and released.
Or just make the changes yourself, I think both of you are able to push to this branch.

test: enforce lower test timeouts

473441b

github-advanced-security bot found potential problems Feb 25, 2025

View reviewed changes

test/dom-parser.test.js Fixed Show fixed Hide fixed

test/dom-parser.test.js Fixed Show fixed Hide fixed

test/dom-parser.test.js Fixed Show fixed Hide fixed

perf: speed up line detection for some cases

7c73493

when the `locator` is not disabled

karfau linked an issue Feb 25, 2025 that may be closed by this pull request

Increased processing times when using \u2028 and \u2029 #838

Closed

karfau changed the title ~~perf: improve line detection regular expression for locator~~ perf: speed up line detection Feb 25, 2025

karfau requested a review from shunkica February 25, 2025 23:22

karfau marked this pull request as ready for review February 25, 2025 23:23

karfau added this to the 0.9.8 milestone Feb 25, 2025

karfau mentioned this pull request Feb 25, 2025

fix: fixed duration check in alternative whitespace test #846

Closed

karfau enabled auto-merge (squash) February 25, 2025 23:29

karfau disabled auto-merge February 25, 2025 23:29

karfau enabled auto-merge (rebase) February 25, 2025 23:29

karfau requested review from Ponynjaa and removed request for shunkica February 25, 2025 23:29

Ponynjaa requested changes Feb 26, 2025

View reviewed changes

karfau added 2 commits February 26, 2025 23:14

test: drop redundant test and correct test description

4a3d1f2

Merge remote-tracking branch 'origin/master' into improve-line-split

4cadc7d

karfau disabled auto-merge February 26, 2025 22:17

karfau enabled auto-merge (squash) February 26, 2025 22:18

kboshold suggested changes Feb 27, 2025

View reviewed changes

karfau disabled auto-merge February 27, 2025 08:36

karfau added 2 commits February 27, 2025 22:23

style: switch whitespace to avoid duplicate chars

0920ea9

Merge remote-tracking branch 'origin/master' into improve-line-split

23e2aaf

karfau requested review from Ponynjaa and kboshold February 27, 2025 21:28

kboshold approved these changes Feb 27, 2025

View reviewed changes

Ponynjaa approved these changes Feb 27, 2025

View reviewed changes

Ponynjaa merged commit d4dc4da into master Feb 27, 2025< 816A /relative-time>
38 checks passed

Ponynjaa deleted the improve-line-split branch February 27, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: speed up line detection #847

perf: speed up line detection #847

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -330,7 +331,20 @@ describe('DOMParser', () => {
		const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;

perf: speed up line detection #847

perf: speed up line detection #847

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!