RESET algorithm segmentation fault and reproducibility

Hello,

I did a little testing of v3.07 and I am running into occasional segmentation faults when enabling the RESET algorithm. They seem to be associated with PIN files with low target/decoy ratios. Unfortunately I cannot share the PIN files.

The problem is also not reproducible. I can process the same file twice with the same parameters and one time it might work and another time it might crash. The following are the console outputs from such a case.

Segmentation fault:

Percolator version 3.06.0, Build Date May 31 2024 11:32:51
Copyright (c) 2006-9 University of Washington. All rights reserved.
Written by Lukas Käll (lukall@u.washington.edu) in the
Department of Genome Sciences at the University of Washington.
Issued command:
./percolator-3.07 --testFDR 0.01 --trainFDR 0.01 --results-psms output/percolator_results/test_file.RESET.FDR=0.01.psms.tsv --post-processing-tdc --reset-algorithm --only-psms output/comet_results/test_file.pin
Started Mon Jun 17 11:20:21 2024
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Finding protein decoy prefix for output/comet_results/test_file.pin
Using protein decoy prefix "rev_"
Reading tab-delimited input from datafile output/comet_results/test_file.pin
Features:
lnrSp deltLCn deltCn lnExpect Xcorr Sp IonFrac Mass PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumSP dM absdM 
Found 36814 PSMs
Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores.
Train/test set contains 19051 positives and 17763 negatives, size ratio=1.07251 and pi0=1
Running the Percolator-RESET algorithm.
Selecting best separating single variable.
No weights file given, using default direction
Selected feature 4 as initial direction. Could separate 230 training set positives with q<0.01 in that direction.
Starting reset: psmsOnly
Selected the best scoring PSM for each of the 32273 peptides from a dataset of 36814 PSMs.
Splitting into train/test
Setting up SVM training for a size of 32273 peptides.
Training set prepared. Starting SVM Training.
02-run_percolator.sh: line 4: 2455196 Segmentation fault      (core dumped)

No segmentation fault:

Percolator version 3.06.0, Build Date May 31 2024 11:32:51
Copyright (c) 2006-9 University of Washington. All rights reserved.
Written by Lukas Käll (lukall@u.washington.edu) in the
Department of Genome Sciences at the University of Washington.
Issued command:
./percolator-3.07 --testFDR 0.01 --trainFDR 0.01 --results-psms output/percolator_results/test_file.RESET.FDR=0.01.psms.tsv --post-processing-tdc --reset-algorithm --only-psms output/comet_results/test_file.pin
Started Mon Jun 17 11:16:35 2024
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Finding protein decoy prefix for output/comet_results/test_file.pin
Using protein decoy prefix "rev_"
Reading tab-delimited input from datafile output/comet_results/test_file.pin
Features:
lnrSp deltLCn deltCn lnExpect Xcorr Sp IonFrac Mass PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumSP dM absdM 
Found 36814 PSMs
Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores.
Train/test set contains 19051 positives and 17763 negatives, size ratio=1.07251 and pi0=1
Running the Percolator-RESET algorithm.
Selecting best separating single variable.
No weights file given, using default direction
Selected feature 4 as initial direction. Could separate 230 training set positives with q<0.01 in that direction.
Starting reset: psmsOnly
Selected the best scoring PSM for each of the 32273 peptides from a dataset of 36814 PSMs.
Splitting into train/test
Setting up SVM training for a size of 32273 peptides.
Training set prepared. Starting SVM Training.
GridSearch found the optimal hyperParameters for SVM training, C+=0.01, and C-/C+=3.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training with a set of size 7873 preptides, where of 0 are target, and 7873 are decoys.
Training Done!
Final feature weights
lnrSp	-0.000202322
deltLCn	0
deltCn	0.000274028
lnExpect	-0.000446514
Xcorr	-0.000594217
Sp	0.00147385
IonFrac	-0.000439271
Mass	-0.000243748
PepLen	0.00010638
Charge1	0
Charge2	-3.98033e-05
Charge3	3.98033e-05
enzN	5.83145e-05
enzC	2.70361e-05
enzInt	9.31679e-05
lnNumSP	-0.000228027
dM	-3.20245e-05
absdM	-1.73944e-05
Intercept	-0.995624
Test set evaluation finds 0 peptides under a FDR of 0.01
Processing took 3.9999 cpu seconds or 3 seconds wall clock time.

The issue of reproducibility also seems to apply in general to the RESET implementation in v3.07. The following outputs are from a run with more true positives, but it can be seen that the optimal hyper parameters, final weights, and the numbers of identified peptides are different each time the file is processed. It was processed with identical parameters each time.

GridSearch found the optimal hyperParameters for SVM training, C+=10, and C-/C+=1.
Training with a set of size 16048 preptides, where of 911 are target, and 15137 are decoys.
Training with a set of size 16717 preptides, where of 1580 are target, and 15137 are decoys.
Training with a set of size 16995 preptides, where of 1858 are target, and 15137 are decoys.
Training with a set of size 17053 preptides, where of 1916 are target, and 15137 are decoys.
Training with a set of size 17131 preptides, where of 1994 are target, and 15137 are decoys.
Training Done!
Final feature weights
lnrSp	-2.67283
deltLCn	0
deltCn	0.467299
lnExpect	-0.650616
Xcorr	0.36079
Sp	-0.147871
IonFrac	0.827475
Mass	0.387092
PepLen	-0.109475
Charge1	0
Charge2	-0.205578
Charge3	0.205578
enzN	0.132966
enzC	-0.423166
enzInt	-0.0691106
lnNumSP	0.928027
dM	-1.10335
absdM	-0.78198
Intercept	-10.2233
Test set evaluation finds 2231 peptides under a FDR of 0.01
Processing took 4.3142 cpu seconds or 3 seconds wall clock time.

GridSearch found the optimal hyperParameters for SVM training, C+=10, and C-/C+=1.
Training with a set of size 16051 preptides, where of 915 are target, and 15136 are decoys.
Training with a set of size 16778 preptides, where of 1642 are target, and 15136 are decoys.
Training with a set of size 17073 preptides, where of 1937 are target, and 15136 are decoys.
Training with a set of size 17116 preptides, where of 1980 are target, and 15136 are decoys.
Training with a set of size 17185 preptides, where of 2049 are target, and 15136 are decoys.
Training Done!
Final feature weights
lnrSp	-2.64506
deltLCn	0
deltCn	0.488246
lnExpect	-0.517343
Xcorr	0.397673
Sp	-0.149412
IonFrac	0.747181
Mass	0.373851
PepLen	-0.0860658
Charge1	0
Charge2	-0.143679
Charge3	0.143679
enzN	0.130385
enzC	-0.434576
enzInt	-0.0295881
lnNumSP	0.799962
dM	-1.08981
absdM	-0.839981
Intercept	-9.7631
Test set evaluation finds 2254 peptides under a FDR of 0.01
Processing took 3.3856 cpu seconds or 2 seconds wall clock time.

GridSearch found the optimal hyperParameters for SVM training, C+=0.1, and C-/C+=0.1.
Training with a set of size 16034 preptides, where of 915 are target, and 15119 are decoys.
Training with a set of size 16121 preptides, where of 1002 are target, and 15119 are decoys.
Training with a set of size 16177 preptides, where of 1058 are target, and 15119 are decoys.
Training with a set of size 16186 preptides, where of 1067 are target, and 15119 are decoys.
Training with a set of size 16195 preptides, where of 1076 are target, and 15119 are decoys.
Training Done!
Final feature weights
lnrSp	0.0894803
deltLCn	0
deltCn	0.249886
lnExpect	-0.266886
Xcorr	-0.0604473
Sp	0.132344
IonFrac	0.0899176
Mass	0.0695621
PepLen	-0.0352744
Charge1	0
Charge2	-0.0242186
Charge3	0.0242186
enzN	0.00495253
enzC	-0.023019
enzInt	0.0114291
lnNumSP	0.0363138
dM	-0.0629611
absdM	-0.0594788
Intercept	-1.3715
Test set evaluation finds 1835 peptides under a FDR of 0.01
Processing took 3.7278 cpu seconds or 3 seconds wall clock time.

Also not very critical, but I just noticed a misspelling in peptides when looking at the above output:

Training with a set of size 16186 preptides, where of 1067 are target, and 15119 are decoys.

Best,
Kevin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions