Description
I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:
- Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How do 5B9D es Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below
cat run_info.json
{
"n_targets": 219393688,
"n_bootstraps": 0,
"n_processed": 340480788,
"n_pseudoaligned": 90125537,
"n_unique": 46079947,
"p_pseudoaligned": 26.5,
"p_unique": 13.5,
"kallisto_version": "0.48.0",
"index_version": -1293124848,
"start_time": "Sun Jul 2 21:08:02 2023",
"call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz"
}
cat inspect.json
{
"numRecords": 37168977,
"numReads": 92983952,
"numBarcodes": 1937400,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 47.994194,
"numUMIs": 12884443,
"numBarcodeUMIs": 33830912,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 17.462017,
"gtRecords": 11411219,
"numBarcodesOnWhitelist": 469183,
"percentageBarcodesOnWhitelist": 24.217147,
"numReadsOnWhitelist": 85704414,
"percentageReadsOnWhitelist": 92.171189
-
Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?
-
A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.
Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.
Sara Knaack