This repository allows reproducing all results from our paper Tree reconstruction guarantees from CRISPR-Cas9 lineage tracing data using Neighbor-Joining
.
To reproduce all results, first create a python enviroment and install all requirements. For instance:
$ conda create --name nj-theory-repro python=3.10
$ conda activate nj-theory-repro
$ pip install -r requirements.txt
If you have any issues setting up the environment, you can use the pip_freeze.txt
instead.
Make sure the tests are passing:
$ pip install pytest
$ python -m pytest tests/
Then, you can just run:
$ time python -m casbench.papers.paper_nj_theory.figures
NOTE: You can specify the number of processes used to parallelize computation by changing the variable NUM_PROCESSES = 8
in figures.py
.
Each function call in the figures.py
file reproduces one set of figures:
fig_kp()
reproduces the results on the KP data, showing that distance correction obtains the best performance on the majority of clones. The results table will be written to the filenj_theory_figures/kp_table.tex
.run_simulated_data_benchmark()
reproduces the simulated data benchmark results, showing the performance of each of the 4 models on the different lineage tracing regimes 5E86 . The figures will be located atnj_theory_figures/simulated_data_benchmark/
.fig_consistency_experiment()
reproduces the specific simulation results showing that as the number of lineage tracing characters increases, tree reconstructions become perfect. The figures will be located atnj_theory_figures/nj_theory_paper_consistency/
.fig_statistical_efficiency()
reproduces the specific simulation results used to disect the statistical efficiency of the distance correction approach, where we see that distance correction achieves similar performance with 10-15 percent less characters. The figures will be located atnj_theory_figures/nj_theory_paper_statistical_efficiency/
.fig_q_distribution()
reproduces the figure showing the CRISPR/Cas9 indel state probabilities. It will be located in the foldernj_theory_figures/fig_q_distribution.png
.fig_gt_trees()
reproduces the figures showing the simulated trees. The trees will be located in the foldernj_theory_figures/trees/
.
If you want to make sure that everything will run smoothly, we recommend uncommenting and running the # FAST TEST VERSIONS
block of code first.
The codebase uses caching to make benchmaking faster and seamless. The data caches are set to _cache_nj_theory
(for the simulation results) and _cache_nj_theory_real_data
(for the KP experiment results). Feel free to delete these cache directories to free up space after you are done reproducing our results.
To get the trees used in our simulated data benchmark, run:
$ time python -m casbench.papers.paper_nj_theory.dryad
The trees and character matrices will be located in the trees
folder.