-
-
Notifications
You must be signed in to change notification settings - Fork 356
OpenSwathDecoyGenerator remove duplicates #6054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously if peptide did not end in K or R switchKR would replace the last AA with a random AA. Here, the behaviour is changed such that if K or R is not present, we keep the terminal AA. This new behaviour means that the same nontryptic peptide sequence will always generate the same decoy peptide.
Not sure how to add to documentation without changing the format of the method. So had to change return type from void to peptide
Add a check to check that decoy sequences generated do not happen to be target sequences (this will cause an error in OpenSwathWorkflow). Also added appropriate tests. Current method works, but is quite slow with larger libraries.
instead of searching using vector, search using unordered map for decoy sequences in targets which drastically speeds up runtime
this allows file to be converted to .tsv without error
This reverts commit e7b5f5d. Change test back to the previous one because changed switchKR behaviour back to default.
with switchKR feature can randomly get 2 peptide sequences that are not tryptic and almost identical generate the same decoy e.g. PEPTIDA --> DITPEPQ and PEPTIDE --> DITPEPQ (Q is randomly chosen for the final AA twice) - Add check to ensure that this decoy sequence is not added twice - modify openswathdecoy_test5 to test for this function - since hard to test the random nature of switchKR, add 2 duplicate target peptides and ensure the decoy is only generated once - Also add the same target peptide with a modification, and ensure that this peptide does get a decoy generated for it (because it is a different sequence)
full_peptide_name += pep.sequence[loc]; | ||
} | ||
// C-terminal and N-terminal modifications may be at positions -1 or pep.sequence | ||
for (Size modloc = 0; modloc < pep.mods.size(); modloc++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this inner loop is potentially very inefficient. How often is this function called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called once per peptide sequence so it might be called a lot. However the inner loop should only be called maybe maximum of 3 times because theoretically there are not too many modifications on a peptide. This code block is actually taken from OpenMS tsv writer.
OpenMS/src/openms/source/ANALYSIS/OPENSWATH/OpenSwathTSVWriter.cpp
Lines 145 to 160 in c2555c3
String full_peptide_name = ""; | |
for (int loc = -1; loc <= (int)pep.sequence.size(); loc++) | |
{ | |
if (loc > -1 && loc < (int)pep.sequence.size()) | |
{ | |
full_peptide_name += pep.sequence[loc]; | |
} | |
// C-terminal and N-terminal modifications may be at positions -1 or pep.sequence | |
for (Size modloc = 0; modloc < pep.modifications.size(); modloc++) | |
{ | |
if (pep.modifications[modloc].location == loc) | |
{ | |
full_peptide_name += "(UniMod:" + String(pep.modifications[modloc].unimod_id) + ")"; | |
} | |
} | |
} |
however this is probably not the most efficient way. Basically here I am just looking for a data structure which can hold modified peptide sequences to check for equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running with a big library with over 2000000 it generates decoys in a reasonable time for me (e.g. 20 sec) so I think this implementation is good enough for now
/reformat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- some minor formatting / typos
- potential inefficent generation of modified strings
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Btw you can add suggestions to a batch so that only one commit is created. Each commit starts a CI run of which we only have a limited amount per month. |
Sorry and thanks for letting me know. I will do this next time. |
{ | ||
if (pep.mods[modloc].location == loc) | ||
{ | ||
full_peptide_name += "(UniMod:" + String(pep.mods[modloc].unimod_id) + ")"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are still some style errors. e.g. here no indentation. below it uses probably tabs instead of two spaces. maybe give it another quick check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give try to give a quick fix to move this forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't push to Roestlab... I think you need to do the changes yourself. sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem I'll take a look
manually go through uncrustify changes and change spacing and tabbing accordingly.
Description
Addressing issue #5653
This commit adds a check to OpenSwathDecoyGenerator to ensure that a decoy sequence is not generated if it is the same as a target sequence or another decoy sequence.
This check is done with modifications meaning that the same peptide sequence with different modifications can still have unique decoys generated
Tests are added and tests are changed to reflect the new behaviour.
Checklist:
How can I get additional information on failed tests during CI:
If your PR is failing you can check out
Note:
Advanced commands (admins / reviewer only):
/reformat
(experimental) applies the clang-format style changes as additional commitrebuild jenkins
will retrigger Jenkins-based CI builds