8000 RESET algorithm segmentation fault and reproducibility · Issue #379 · percolator/percolator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

RESET algorithm segmentation fault and reproducibility #379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kevinkovalchik opened this issue Jun 17, 2024 · 5 comments
Open

RESET algorithm segmentation fault and reproducibility #379

kevinkovalchik opened this issue Jun 17, 2024 · 5 comments

Comments

@kevinkovalchik
Copy link

Hello,

I did a little testing of v3.07 and I am running into occasional segmentation faults when enabling the RESET algorithm. They seem to be associated with PIN files with low target/decoy ratios. Unfortunately I cannot share the PIN files.

The problem is also not reproducible. I can process the same file twice with the same parameters and one time it might work and another time it might crash. The following are the console outputs from such a case.

Segmentation fault:

Percolator version 3.06.0, Build Date May 31 2024 11:32:51
Copyright (c) 2006-9 University of Washington. All rights reserved.
Written by Lukas Käll (lukall@u.washington.edu) in the
Department of Genome Sciences at the University of Washington.
Issued command:
./percolator-3.07 --testFDR 0.01 --trainFDR 0.01 --results-psms output/percolator_results/test_file.RESET.FDR=0.01.psms.tsv --post-processing-tdc --reset-algorithm --only-psms output/comet_results/test_file.pin
Started Mon Jun 17 11:20:21 2024
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Finding protein decoy prefix for output/comet_results/test_file.pin
Using protein decoy prefix "rev_"
Reading tab-delimited input from datafile output/comet_results/test_file.pin
Features:
lnrSp deltLCn deltCn lnExpect Xcorr Sp IonFrac Mass PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumSP dM absdM 
Found 36814 PSMs
Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores.
Train/test set contains 19051 positives and 17763 negatives, size ratio=1.07251 and pi0=1
Running the Percolator-RESET algorithm.
Selecting best separating single variable.
No weights file given, using default direction
Selected feature 4 as initial direction. Could separate 230 training set positives with q<0.01 in that direction.
Starting reset: psmsOnly
Selected the best scoring PSM for each of the 32273 peptides from a dataset of 36814 PSMs.
Splitting into train/test
Setting up SVM training for a size of 32273 peptides.
Training set prepared. Starting SVM Training.
02-run_percolator.sh: line 4: 2455196 Segmentation fault      (core dumped)

No segmentation fault:

Percolator version 3.06.0, Build Date May 31 2024 11:32:51
Copyright (c) 2006-9 University of Washington. All rights reserved.
Written by Lukas Käll (lukall@u.washington.edu) in the
Department of Genome Sciences at the University of Washington.
Issued command:
./percolator-3.07 --testFDR 0.01 --trainFDR 0.01 --results-psms output/percolator_results/test_file.RESET.FDR=0.01.psms.tsv --post-processing-tdc --reset-algorithm --only-psms output/comet_results/test_file.pin
Started Mon Jun 17 11:16:35 2024
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Finding protein decoy prefix for output/comet_results/test_file.pin
Using protein decoy prefix "rev_"
Reading tab-delimited input from datafile output/comet_results/test_file.pin
Features:
lnrSp deltLCn deltCn lnExpect Xcorr Sp IonFrac Mass PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumSP dM absdM 
Found 36814 PSMs
Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores.
Train/test set contains 19051 positives and 17763 negatives, size ratio=1.07251 and pi0=1
Running the Percolator-RESET algorithm.
Selecting best separating single variable.
No weights file given, using default direction
Selected feature 4 as initial direction. Could separate 230 training set positives with q<0.01 in that direction.
Starting reset: psmsOnly
Selected the best scoring PSM for each of the 32273 peptides from a dataset of 36814 PSMs.
Splitting into train/test
Setting up SVM training for a size of 32273 peptides.
Training set prepared. Starting SVM Training.
GridSearch found the optimal hyperParameters for SVM training, C+=0.01, and C-/C+=3.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training Done!
Final feature weights
lnrSp	-0.000202322
deltLCn	0
deltCn	0.000274028
lnExpect	-0.000446514
Xcorr	-0.000594217
Sp	0.00147385
IonFrac	-0.000439271
Mass	-0.000243748
PepLen	0.00010638
Charge1	0
Charge2	-3.98033e-05
Charge3	3.98033e-05
enzN	5.83145e-05
enzC	2.70361e-05
enzInt	9.31679e-05
lnNumSP	-0.000228027
dM	-3.20245e-05
absdM	-1.73944e-05
Intercept	-0.995624
Test set evaluation finds 0 peptides under a FDR of 0.01
Processing took 3.9999 cpu seconds or 3 seconds wall clock time.

The issue of reproducibility also seems to apply in general to the RESET implementation in v3.07. The following outputs are from a run with more true positives, but it can be seen that the optimal hyper parameters, final weights, and the numbers of identified peptides are different each time the file is processed. It was processed with identical parameters each time.

GridSearch found the optimal hyperParameters for SVM training, C+=10, and C-/C+=1.
Training with a set of size 16048 preptides, where of 911 are target, and 15137 are decoys.
Training with a set of size 16717 preptides, where of 1580 are target, and 15137 are decoys.
Training with a set of size 16995 preptides, where of 1858 are target, and 15137 are decoys.
Training with a set of size 17053 preptides, where of 1916 are target, and 15137 are decoys.
Training with a set of size 17131 preptides, where of 1994 are target, and 15137 are decoys.
Training Done!
Final feature weights
lnrSp	-2.67283
deltLCn	0
deltCn	0.467299
lnExpect	-0.650616
Xcorr	0.36079
Sp	-0.147871
IonFrac	0.827475
Mass	0.387092
PepLen	-0.109475
Charge1	0
Charge2	-0.205578
Charge3	0.205578
enzN	0.132966
enzC	-0.423166
enzInt	-0.0691106
lnNumSP	0.928027
dM	-1.10335
absdM	-0.78198
Intercept	-10.2233
Test set evaluation finds 2231 peptides under a FDR of 0.01
Processing took 4.3142 cpu seconds or 3 seconds wall clock time.

GridSearch found the optimal hyperParameters for SVM training, C+=10, and C-/C+=1.
Training with a set of size 16051 preptides, where of 915 are target, and 15136 are decoys.
Training with a set of size 16778 preptides, where of 1642 are target, and 15136 are decoys.
Training with a set of size 17073 preptides, where of 1937 are target, and 15136 are decoys.
Training with a set of size 17116 preptides, where of 1980 are target, and 15136 are decoys.
Training with a set of size 17185 preptides, where of 2049 are target, and 15136 are decoys.
Training Done!
Final feature weights
lnrSp	-2.64506
deltLCn	0
deltCn	0.488246
lnExpect	-0.517343
Xcorr	0.397673
Sp	-0.149412
IonFrac	0.747181
Mass	0.373851
PepLen	-0.0860658
Charge1	0
Charge2	-0.143679
Charge3	0.143679
enzN	0.130385
enzC	-0.434576
enzInt	-0.0295881
lnNumSP	0.799962
dM	-1.08981
absdM	-0.839981
Intercept	-9.7631
Test set evaluation finds 2254 peptides under a FDR of 0.01
Processing took 3.3856 cpu seconds or 2 seconds wall clock time.

GridSearch found the optimal hyperParameters for SVM training, C+=0.1, and C-/C+=0.1.
Training with a set of size 16034 preptides, where of 915 are target, and 15119 are decoys.
Training with a set of size 16121 preptides, where of 1002 are target, and 15119 are decoys.
Training with a set of size 16177 preptides, where of 1058 are target, and 15119 are decoys.
Training with a set of size 16186 preptides, where of 1067 are target, and 15119 are decoys.
Training with a set of size 16195 preptides, where of 1076 are target, and 15119 are decoys.
Training Done!
Final feature weights
lnrSp	0.0894803
deltLCn	0
deltCn	0.249886
lnExpect	-0.266886
Xcorr	-0.0604473
Sp	0.132344
IonFrac	0.0899176
Mass	0.0695621
PepLen	-0.0352744
Charge1	0
Charge2	-0.0242186
Charge3	0.0242186
enzN	0.00495253
enzC	-0.023019
enzInt	0.0114291
lnNumSP	0.0363138
dM	-0.0629611
absdM	-0.0594788
Intercept	-1.3715
Test set evaluation finds 1835 peptides under a FDR of 0.01
Processing took 3.7278 cpu seconds or 3 seconds wall clock time.

Also not very critical, but I just noticed a misspelling in peptides when looking at the above output:

Training with a set of size 16186 preptides, where of 1067 are target, and 15119 are decoys.

Best,
Kevin

@percolator
Copy link
Owner
percolator commented Jun 18, 2024 via email

@kevinkovalchik
Copy link
Author
kevinkovalchik commented Jun 20, 2024

I'm using the portable ubuntu distribution from the 3.07 release. I think the wrong version is being reported, because the release date of 3.06 was a while agobut this says May of 2024:
image

And it looks like src/Version.h defines the version as 3.06:
image

Thanks for the clarification on the seed. The command line help indicates that there is a default value for --seed:
image

Is this not the case? Or is it just not the case when running percolator-reset?

@ds2268
Copy link
ds2268 commented Jun 20, 2024

Why is there version: Percolator version 3.06.0, Build Date May 31 2024 11:32:51

Should't it be 3.07?

@percolator
Copy link
Owner
percolator commented Jun 20, 2024 via email

@kevinkovalchik
Copy link
Author

I'd just like to follow up on this. I'm glad the version mix up got fixed, but do you have any advice on the segmentation fault? To clarify again, the error messages above are from v3.07. I am guessing the new algorithm is failing when there are not enough positive examples, but of course it is hard to tell since all it does is throw a seg fault.

And can you clarify when the seed is and is not set? The documentation indicates the default value is 1, so I would not have expected that I need to set this in order for the RESET algorithm to be reproducible:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0