8000 Verifying status of "uniqueness" of alignments for single cell analysis · Issue #54 · pachterlab/kallistobustools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Verifying status of "uniqueness" of alignments for single cell analysis #54
Open
@sknaack

Description

@sknaack

I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:

  1. Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How do 5B9D es Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below

cat run_info.json
{
"n_targets": 219393688,
"n_bootstraps": 0,
"n_processed": 340480788,
"n_pseudoaligned": 90125537,
"n_unique": 46079947,
"p_pseudoaligned": 26.5,
"p_unique": 13.5,
"kallisto_version": "0.48.0",
"index_version": -1293124848,
"start_time": "Sun Jul 2 21:08:02 2023",
"call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz"
}
cat inspect.json
{
"numRecords": 37168977,
"numReads": 92983952,
"numBarcodes": 1937400,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 47.994194,
"numUMIs": 12884443,
"numBarcodeUMIs": 33830912,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 17.462017,
"gtRecords": 11411219,
"numBarcodesOnWhitelist": 469183,
"percentageBarcodesOnWhitelist": 24.217147,
"numReadsOnWhitelist": 85704414,
"percentageReadsOnWhitelist": 92.171189

  1. Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?

  2. A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.

Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.

Sara Knaack

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0