8000 perf: speed up line detection by karfau · Pull Request #847 · xmldom/xmldom · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf: speed up line detection #847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 27, 2025
Merged

perf: speed up line detection #847

merged 6 commits into from
Feb 27, 2025

Conversation

karfau
Copy link
Member
@karfau karfau commented Feb 25, 2025
  • enforce timeouts on tests and specifically for the one not using the default normalizeLineEndings, so it actually fails.
  • improve reused regular expression to not contain .* twice, to speed up processing larger files containing certain Unicode chars as demonstrated in Increased processing times when using \u2028 and \u2029 #838

when the `locator` is not disabled
Copy link
codecov bot commented Feb 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.12%. Comparing base (bfb5c44) to head (23e2aaf).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #847   +/-   ##
=======================================
  Coverage   95.12%   95.12%           
=======================================
  Files           8        8           
  Lines        2196     2196           
  Branches      577      577           
=======================================
  Hits         2089     2089           
  Misses        107      107           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@karfau karfau linked an issue Feb 25, 2025 that may be closed by this pull request
@karfau karfau changed the title perf: improve line detection regular expression for locator perf: speed up line detection Feb 25, 2025
@karfau karfau requested a review from shunkica February 25, 2025 23:22
@karfau karfau marked this pull request as ready for review February 25, 2025 23:23
@karfau
Copy link
Member Author
karfau commented Feb 25, 2025

@kboshold this one is an actual fix for the performance issue from your initial example, even when not normalizing the input.
Also the timeout of the test is now enforced.

@karfau karfau added this to the 0.9.8 milestone Feb 25, 2025
@karfau karfau enabled auto-merge (squash) February 25, 2025 23:29
@karfau karfau disabled auto-merge February 25, 2025 23:29
@karfau karfau enabled auto-merge (rebase) February 25, 2025 23:29
@karfau karfau requested review from Ponynjaa and removed request for shunkica February 25, 2025 23:29
Copy link
Collaborator
@Ponynjaa Ponynjaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor things I would change: the test-descriptions on the test you copied and the new one don't describe what's actually being tested.

Other than that it looks really good! I was a little confused at first when I saw your changes to the regex and loop, but after playing around with it for a little bit and testing the outcomes it seems to work exactly as before but without matching the content inbetween line-endings for no reason. Good job dude!

@@ -330,7 +331,20 @@ describe('DOMParser', () => {
const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;
const doc = parser.parseFromString(source, MIME_TYPE.XML_TEXT);
expect(new XMLSerializer().serializeToString(doc)).toEqual(source.replace(/[\u0085\u2028\u2029]/g, '\n'));
}, 500);
});
test('should be able to open documents with alternative whitespace without creating a bottleneck and replacing them with \\n', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and replacing them with \n

this description doesn't make sense for this test

@@ -330,7 +331,20 @@ describe('DOMParser', () => {
const source = `<root>${'A'.repeat(50000)}\u2029${'A'.repeat(50000)}\u0085${'A'.repeat(50000)}\u2028${'A'.repeat(50000)}\u2029</root>`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without creating a bottleneck

The test description here actually also doesn't make sense, maybe you can rephrase it when you are already at it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, somehow the test I copied is also covering the one that @kboshold added, so I merged them into one and corrected the description.

@karfau karfau disabled auto-merge February 26, 2025 22:17
@karfau karfau enabled auto-merge (squash) February 26, 2025 22:18
const >
const normalizeLineEndings = jest.fn((source) => source);
const { parser } = getTestParser({ onError, normalizeLineEndings });
const source = `<root>${'A'.repeat(15000)}\u2029${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to have a \n as the first character, since the test currently has \u2029 twice.

Solution:

const source = `<root>${'A'.repeat(15000)}\n${'A'.repeat(15000)}\u0085${'A'.repeat(15000)}\u2028${'A'.repeat(15000)}\u2029</root>`;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's such big of a difference, but hey, I applied your suggestion.

@karfau karfau disabled auto-merge February 27, 2025 08:36
@karfau
Copy link
Member Author
karfau commented Feb 27, 2025

I applied all the suggested changes, so please approve it.
I would say the PR is already an improvement that we should get landed and released.
Or just make the changes yourself, I think both of you are able to push to this branch.

@karfau karfau requested review from Ponynjaa and kboshold February 27, 2025 21:28
@Ponynjaa Ponynjaa merged commit d4dc4da into master Feb 27, 2025< 816A /relative-time>
38 checks passed
@Ponynjaa Ponynjaa deleted the improve-line-split branch February 27, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increased processing times when using \u2028 and \u2029
3 participants
0