Bayesian Inference and Distance Calculation for Single-Cell RNA-seq Data
SanityR provides an R interface to the Sanity model, described in Breda et al. (2021), Nature Biotechnology for single-cell gene expression analysis. It offers tools for:
- Bayesian estimation of log normalized counts and their uncertainty.
- Computing statistically sound distances between cells while accounting for uncertainty.
- Integrates with
SingleCellExperiment
to be used as part of the Bioconductor Single Cell Workflow
remotes::install_github("TeoSakel/SanityR")
library(SanityR)
# Simulate data
sce <- simulate_branched_random_walk(N_path = 10, length_path = 10, N_gene = 200)
# Run Sanity estimation
sce <- Sanity(sce)
# Compute distances
dist <- calculateSanityDistance(sce)
# Perform clustering or visualization
plot(hclust(dist))
sce <- Sanity(sce)
logcounts(sce)
Log-normalizes the UMI counts and estimates error bars for each value using a hierarchical Bayesian Model:
where:
-
$n_{gc}$ is the observed UMI count for gene$g$ in cell$c$ . -
$\lambda_c$ is the cell-specific transcription rate. -
$\alpha_g$ is the mean activity quotient of the gene$g$ . -
$a$ and$b$ are prior hyperparameters for the Gamma distribution. -
$\delta_{gc}$ is the log fold-change of activity for gene$g$ in cell$c$ versus the mean. -
$v_g$ is the prior variance of the log fold-change for gene$g$ .
Log-normalized counts in this model are calculated as:
dist <- calculateSanityDistance(sce)
Computes the expected Euclidean distance between cells, accounting for measurement uncertainty:
where:
-
$d$ is the distance between cells$c$ and$c'$ -
$\delta_{gc}$ is the log fold-change of activity for gene$g$ in cell$c$ computed bySanity
. -
$\eta_g = \epsilon_{gc} + \epsilon_{gc'}$ is the sum of posterior variances of$\delta_{gc}$ . -
$\Delta_g$ is the “true” distance along the dimension of gene$g$ . -
$\alpha$ is a hyperparameter that controls the correlation between cells (0 = fully correlated, 2 = fully independent).
The function requires Sanity()
to have been run before to estimate
dist
object suitable for clustering or
embedding.
Provides two functions to generate synthetic datasets for benchmarking using the generative processe described in the original paper:
simulate_independent_cells()
: Simulates cells with independent gene expression profiles.simulate_branched_random_walk()
: Simulates cells with pseudo-temporal trajectories forming a tree.
sce_indep <- simulate_independent_cells(N_cell = 100, N_gene = 50)
sce_branch <- simulate_branched_random_walk(N_path = 20, length_path = 5, N_gene = 50)
Both functions return a SingleCellExperiment
object.
Breda, J., Zavolan, M., & van Nimwegen, E. Bayesian inference of gene expression states from single-cell RNA-seq data. Nature Biotechnology, 39, 1008–1016 (2021). doi:10.1038/s41587-021-00875-x
Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nature Methods 17, 137–145 (2020). doi:10.1038/s41592-019-0654-x