8000 Use xxh64 to hash the s-mer in syncmers by marcelm · Pull Request #313 · ksahlin/strobealign · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Use xxh64 to hash the s-mer in syncmers #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2023
Merged

Use xxh64 to hash the s-mer in syncmers #313

merged 1 commit into from
Jun 20, 2023

Conversation

marcelm
Copy link
Collaborator
@marcelm marcelm commented Jun 20, 2023

Single-end accuracy

dataset main (37f30e5) this PR (6c67bfc)
drosophila-50 82.018 82.5118 (+0.4938)
drosophila-100 89.1866 89.2585 (+0.0719)
drosophila-300 93.2196 93.2169 (-0.0027)
CHM13-50 81.1589 81.691 (+0.5321)
CHM13-100 90.5471 90.6392 (+0.0921)
CHM13-150 92.377 92.399 (+0.0220)
CHM13-200 93.2054 93.2258 (+0.0204)
CHM13-300 94.1524 94.1503 (-0.0021)
maize-50 47.1753 47.4174 (+0.2421)
maize-100 70.4836 70.5 (+0.0164)
maize-200 86.6856 86.701 (+0.0154)
rye-50 44.4281 44.6869 (+0.2588)
rye-100 69.2452 69.3358 (+0.0906)

@ksahlin
Copy link
Owner
ksahlin commented Jun 20, 2023

Awesome, approved!

@marcelm marcelm merged commit e32475e into main Jun 20, 2023
@marcelm marcelm deleted the syncmerhash branch June 20, 2023 14:28
@marcelm
Copy link
Collaborator Author
marcelm commented Jun 20, 2023

While measuring memory usage for #314, I also got numbers for how much memory increases due to this PR (XXH64). Here are the numbers for completeness.

Change in memory usage (MB)

dataset before (e0764b6) after (e32475e) difference
drosophila-50 709.043 737.812 +29
drosophila-100 736.922 770.031 +33
drosophila-200 779.059 817.492 +38
drosophila-300 841.98 863.266 +21
maize-50 9661.88 10008.3 +346
maize-100 9670.18 10017.2 +347
maize-200 9677.71 10025.3 +348
maize-300 9680.55 10025.9 +345
CHM13-50 13977.4 14694.7 +717
CHM13-100 13968.8 14685.5 +717
CHM13-200 14005.1 14717.6 +712
CHM13-300 14012.8 14789 +776
rye-50 32938.5 34110.6 +1172
rye-100 32981.6 34155.2 +1174

It’s not that much, but I feel a little bit bad because we’re again eating up some of the memory savings from #278. It’ll be a bit weird in the changelog: "We reduced memory usage from 23 GiB to 13 GiB, but then made other changes and it’s now back at 14.7 GiB" ...

@ksahlin
Copy link
Owner
ksahlin commented Jun 20, 2023

I don't feel bad about it :) PR #278 allows us to

(i) change to 64-bit pointers (which robin hood couldn't do) which explains part of the re-increase.
(ii) Allows us to have an 'unbiased' syncmer sampling protocol near expected density that also improves mapping rate and accuracy.

@marcelm
Copy link
Collaborator Author
marcelm commented Jun 20, 2023

I only wrote "a little bit bad" ... 😄 No I think it’s fine.

Answering here regarding b tipping over: No, it’s 28 in both cases. It appears to be just the size of the randstrobes vector: For CHM13-100, it contains 576889641 randstrobes before and 623250666 after. With 16 bytes/entry, that’s 740 MB more, so that should explain it.

@ksahlin
Copy link
Owner
ksahlin commented Jun 20, 2023

Ah makes sense!

If we are out for saving memory we could always experiment with density (e.g., k-s = 6 instead of 4) but I think it's fine as is. The mapping rate would go down for the shortest reads with more thinning.

@marcelm marcelm mentioned this pull request Jun 21, 2023
12 tasks
rebjannik pushed a commit to rebjannik/strobealign that referenced this pull request May 17, 2025
Use xxh64 to hash the s-mer in syncmers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0