To evaluate imputation performance, you will need:
- A VCF file containing imputed data for each sample.
- A corresponding high-coverage VCF file for the same sample.
⚠️ Important: Sample names must match exactly in the headers of both files.
We worked with one VCF per individual.
For each sample, run:
SnpSift concordance SAMPLE_NAME_HIGH.COVERAGE.vcf SAMPLE_NAME_IMPUTED.vcf > comparison_SAMPLE_NAME.txt
This will generate three output files:
comparison_SAMPLE_NAME.txt
: Concordance by variant*.by_sample.txt
: Concordance by sample*.summary.txt
: Summary
Run the following script on the *.by_sample.txt
file:
perl SnpSiftStats.pl *.by_sample.txt
This will generate, for each sample, a summary file like SAMPLE_NAME.matrix
, showing genotype concordance between imputed and high-coverage data:
Sample ID: SAMPLE_NAME
h/i 0/0 0/1 1/1 ./.
0/0 31572015 7329 10 1160372
0/1 22344 1121895 2636 738005
1/1 206 9016 786197 344940
./. 0 0 0 0
This matrix reports the number of concordant and discordant genotype calls. For example:
10
sites were expected as0/0
in the high-coverage data but imputed as1/1
→ discordant.1,121,895
sites were correctly imputed as0/1
.
In this example, no missing sites are present. However, missingness might appear if, for example, the imputed file is filtered based on genotype probability (GP), resulting in missing calls (./.
).
We recommend repeating this step after different post-imputation filtering strategies to compare their impact.
Run:
perl addfreq_summarybin.pl comparison_SAMPLE_NAME.txt freqfile
You’ll need:
- The
comparison_SAMPLE_NAME.txt
file from step 1. - A frequency file for each variant, computed with PLINK on the reference panel used for imputation (calculate per chromosome, then merge into a single file).
Example format of the frequency file:
CHR SNP A1 A2 MAF NCHROBS
1 1_13380_C_G G C 0.0001102 45382
1 1_16071_G_A A G 0.0001763 45382
1 1_16141_C_T T C 0.0001542 45382
This step:
- Adds a frequency column to each variant.
- Splits the variants into 6 frequency bins:
Output File | Frequency Range |
---|---|
comparison.with.freq_lt0.001 | freq ≤ 0.001 |
comparison.with.freq_0.001-0.01 | 0.001 < freq ≤ 0.01 |
comparison.with.freq_0.01-0.05 | 0.01 < freq ≤ 0.05 |
comparison.with.freq_0.05-0.1 | 0.05 < freq ≤ 0.1 |
comparison.with.freq_0.1-0.3 | 0.1 < freq ≤ 0.3 |
comparison.with.freq_gt0.3 | freq > 0.3 |
Additionally, a cumulative file is generated for all variants with freq > 0.05:
comparison.with.freq_gt0.05
→ union of bins 0.05–0.1, 0.1–0.3, and >0.3
Each bin also has a corresponding summary file:
e.g. freq_0.001-0.01_summary
, which aggregates concordance statistics for that bin.
Run:
for i in *_summary; do perl SnpSiftStats.pl $i; done
Each bin will now have a .matrix
file. For example, gt0.05.matrix
:
Sample ID: gt0.05
h/i 0/0 0/1 1/1 ./.
0/0 2066128 7033 5 674900
0/1 10655 1070164 2459 651781
1/1 8 8933 580247 333651
./. 0 0 0 0
These matrices summarize concordance/discordance between imputed and high-coverage genotypes within specific frequency ranges.
Use the .matrix
files generated in Step 4 as input:
for m in *.matrix; do perl PostImputationStats.pl $m; done
This will generate a summary file for each frequency bin, containing a set of statistics on concordance and accuracy of imputed genotypes.
Column | Description |
---|---|
MAF.bin |
Minor Allele Frequency bin name |
#Het |
Correctly imputed heterozygous sites (count and percentage) |
#Ref |
Correctly imputed homozygous reference sites |
#Alt |
Correctly imputed homozygous alternate sites |
Total |
Total number of correctly imputed sites (Het + Ref + Alt) |
MisHet(%) |
Percentage of missing heterozygous genotypes |
MisRef(%) |
Percentage of missing homozygous reference genotypes |
MisAlt(%) |
Percentage of missing homozygous alternate genotypes |
MisTot(%) |
Total missing genotype percentage |
Het.Precision(%) |
Precision for heterozygous calls |
Ref.Precision(%) |
Precision for reference calls |
Alt.Precision(%) |
Precision for alternate calls |
Tot.Precision(%) |
Overall precision |
Het.Sensitivity(%) |
Sensitivity for heterozygous calls |
Ref.Sensitivity(%) |
Sensitivity for reference calls |
Alt.Sensitivity(%) |
Sensitivity for alternate calls |
Tot.Sensitivity(%) |
Overall sensitivity |
Het.Specificity(%) |
Specificity for heterozygous calls |
Ref.Specificity(%) |
Specificity for reference calls |
Alt.Specificity(%) |
Specificity for alternate calls |
Tot.Specificity(%) |
Overall specificity |
Het.Accuracy(%) |
Accuracy of heterozygous calls |
Ref.Accuracy(%) |
Accuracy of reference calls |
Alt.Accuracy(%) |
Accuracy of alternate calls |
Tot.Accuracy(%) |
Overall accuracy |
Het.FPR(%) |
False Positive Rate for heterozygous calls |
Ref.FPR(%) |
False Positive Rate for reference calls |
Alt.FPR(%) |
False Positive Rate for alternate calls |
Tot.FPR(%) |
Overall False Positive Rate |
Het.FNR(%) |
False Negative Rate for heterozygous calls |
Ref.FNR(%) |
False Negative Rate for reference calls |
Alt.FNR(%) |
False Negative Rate for alternate calls |
Tot.FNR(%) |
Overall False Negative Rate |
Non-reference.discordance(%) |
Discordance rate for non-reference genotypes |
Non-reference.concordance(%) |
Concordance rate for non-reference genotypes (100 - discordance) |
Once the individual bin files have been generated, merge them into a single summary using:
bash makeSummary.sh
This will produce a single SummaryStats
file aggregating statistics across all MAF bins for the sample.
Before running this step, make sure you have BCFtools installed, and prepare a set of variant list files, one for each frequency bin. Name them as follows:
0_0.001
0.001_0.01
0.01_0.05
0.05_0.1
0.1_0.3
gt0.3
gt0.05
Each file should contain two columns: chromosome and position, extracted from the frequency file used in Step 3:
chr1 598812
chr1 758351
chr1 794299
Now run:
bash makeR2.sh path/to/folder/with/frequency_files/ SAMPLE_NAME_IMPUTED.vcf SAMPLE_NAME_HIGH.COVERAGE.vcf
- Extract variants for each MAF bin from the imputed VCF.
- Create compressed and indexed VCFs for each bin.
- Compare each bin to the high-coverage reference VCF.
- Compute Pearson correlation (R²) between imputed and true genotypes for each bin.
- Append the R² values to the corresponding summary.
Final output will be a file named SummaryStatsR2
, which is identical to the SummaryStats
file from Step 6, but includes an additional column with R² values for each frequency bin.