8000 OpenSwathDecoyGenerator remove duplicates by jcharkow · Pull Request #6054 · OpenMS/OpenMS · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

OpenSwathDecoyGenerator remove duplicates #6054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 6, 2022

Conversation

jcharkow
Copy link
Collaborator

Description

Addressing issue #5653

This commit adds a check to OpenSwathDecoyGenerator to ensure that a decoy sequence is not generated if it is the same as a target sequence or another decoy sequence.

This check is done with modifications meaning that the same peptide sequence with different modifications can still have unique decoys generated

Tests are added and tests are changed to reflect the new behaviour.

Checklist:

  • [ X ] Make sure that you are listed in the AUTHORS file
  • [ X ] Add relevant changes and new features to the CHANGELOG file
  • [ X ] I have commented my code, particularly in hard-to-understand areas
  • [ X ] New and existing unit tests pass locally with my changes
  • [ X ] Updated or added python bindings for changed or new classes. (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI:

If your PR is failing you can check out

Note:

  • Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

Advanced commands (admins / reviewer only):

  • /reformat (experimental) applies the clang-format style changes as additional commit
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

jcharkow added 11 commits April 4, 2022 15:41
Previously if peptide did not end in K or R switchKR would replace the
last AA with a random AA. Here, the behaviour is changed such that if K
or R is not present, we keep the terminal AA. This new behaviour means
that the same nontryptic peptide sequence will always generate the same
decoy peptide.
Not sure how to add to documentation without changing the format of the
method. So had to change return type from void to peptide
Add a check to check that decoy sequences generated do not happen to be
target sequences (this will cause an error in OpenSwathWorkflow). Also
added appropriate tests.

Current method works, but is quite slow with larger libraries.
instead of searching using vector, search using unordered map for decoy
sequences in targets which drastically speeds up runtime
this allows file to be converted to .tsv without error
This reverts commit 79a090f.
Furthermore, also change switch KR back to a void function (revert
5321022 partially)
This reverts commit e7b5f5d.
Change test back to the previous one because changed switchKR behaviour
back to default.
replace tests output to reflect the new changes (to swtichKR for
openswathdecoyGenerator) and for dropping decoy duplicates (MRMDecoy_test)
with switchKR feature can randomly get 2 peptide sequences that are not
tryptic and almost identical generate the same decoy
e.g. PEPTIDA --> DITPEPQ and PEPTIDE --> DITPEPQ (Q is randomly chosen
for the final AA twice)

- Add check to ensure that this decoy sequence is not added twice
- modify openswathdecoy_test5 to test for this function
	- since hard to test the random nature of switchKR, add 2
duplicate target peptides and ensure the decoy is only generated once
	- Also add the same target peptide with a modification, and
ensure that this peptide does get a decoy generated for it (because it
is a different sequence)
@jcharkow jcharkow requested a review from hroest April 20, 2022 21:41
@jcharkow jcharkow changed the title Addressing Issue https://github.com/OpenMS/OpenMS/issues/5653 OpenSwathDecoyGenerator remove duplicates Apr 20, 2022
full_peptide_name += pep.sequence[loc];
}
// C-terminal and N-terminal modifications may be at positions -1 or pep.sequence
for (Size modloc = 0; modloc < pep.mods.size(); modloc++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this inner loop is potentially very inefficient. How often is this function called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called once per peptide sequence so it might be called a lot. However the inner loop should only be called maybe maximum of 3 times because theoretically there are not too many modifications on a peptide. This code block is actually taken from OpenMS tsv writer.

String full_peptide_name = "";
for (int loc = -1; loc <= (int)pep.sequence.size(); loc++)
{
if (loc > -1 && loc < (int)pep.sequence.size())
{
full_peptide_name += pep.sequence[loc];
}
// C-terminal and N-terminal modifications may be at positions -1 or pep.sequence
for (Size modloc = 0; modloc < pep.modifications.size(); modloc++)
{
if (pep.modifications[modloc].location == loc)
{
full_peptide_name += "(UniMod:" + String(pep.modifications[modloc].unimod_id) + ")";
}
}
}

however this is probably not the most efficient way. Basically here I am just looking for a data structure which can hold modified peptide sequences to check for equality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running with a big library with over 2000000 it generates decoys in a reasonable time for me (e.g. 20 sec) so I think this implementation is good enough for now

@timosachsenberg
Copy link
Contributor

/reformat

Copy link
Contributor
@timosachsenberg timosachsenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • some minor formatting / typos
  • potential inefficent generation of modified strings

jcharkow and others added 5 commits April 26, 2022 15:57
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>
@timosachsenberg
Copy link
Contributor

Btw you can add suggestions to a batch so that only one commit is created. Each commit starts a CI run of which we only have a limited amount per month.

@jcharkow
Copy link
Collaborator Author
jcharkow commented Apr 26, 2022

Btw you can add suggestions to a batch so that only one commit is created. Each commit starts a CI run of which we only have a limited amount per month.

Sorry and thanks for letting me know. I will do this next time.

{
if (pep.mods[modloc].location == loc)
{
full_peptide_name += "(UniMod:" + String(pep.mods[modloc].unimod_id) + ")";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are still some style errors. e.g. here no indentation. below it uses probably tabs instead of two spaces. maybe give it another quick check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give try to give a quick fix to move this forward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't push to Roestlab... I think you need to do the changes yourself. sorry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem I'll take a look

@timosachsenberg timosachsenberg merged commit 0fa86cb into OpenMS:develop May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0