File | Description |
---|---|
rRNA_pipeline.py | Full rRNA pipeline |
init.txt | optional paths to alternate databases |
db/ | ssu-rRNA databases |
fastq_filter.py | FASTQ filtering |
fasta_dereplicate.py | FASTA dereplication |
swarm_map.py | run swarm |
swarm_classify_taxonomy.py | classify swarm OTUs |
group_taxa.py | aggregate taxonomic group counts |
purity_plot.py | classifify OTU content |
plot_OTU_purity.r | produce OTU purity plot |
plot_sample_correlations.r | produce sample tree plot |
plot_taxa_groups.r | produce taxonomic groups plot |
plot_diversity.r | produce sample diversity plot |
plot_heatmap.r | produce sample/OTU heatmap |
rRNA_pipeline v0.4 (May 21, 2016)
Full ssu-rRNA, swarm OTU classification pipeline
Usage: rRNA_pipeline.py (options)
-d name : database name (16S, 18S, V4, V9)
-q dir : FASTQ folder
-o file : base filename for results (default: rrna)
-n file : sample names file (optional)
-p : calculate/plot OTU purity
-m int : minimum quality score for FASTQ (default: 30)
-s, --steps list : run only the steps in list (default: All)
-t, --cpus int : number of processes (default: 1)
-w : no overwrite of files, skip completed steps (default)
-W, --overwrite : overwrite files (default if -s)
-h, --help : help
-v, --verbose : more information to stderr
If -s is used, then -W overwrite is the default
Steps list can include any of the following:
All, merge_fastq, chimera, filter_fasta, derep
swarm, classify, plots, purity, split_plastid
plastid_classify, plastid_plots, plastid_purity
Example: rRNA_pipeline.py -d 16S -q ./fastq
If using alternate databases for 16S, Plastid, 18S, 18S_V4, or 18S_V9, specify paths in init.txt.
Basic operation for 16S:
rRNA_pipeline.py -d 16S -q ./fastq
To replace FASTQ filenames with sample names in all output, use -n to specify tab-delimited file (sample_name, FASTQ base name). FASTQ base names may be followed by any of [_R1, _R2, .filtered, .fastq, .fq] in the full FASTQ file name.
The basic pipeline runs relatively quickly, however the extra calculation of OTU purity takes much longer. Use -p to calculate and plot purity.
Use the following for 18S V4, with sample names, run on 4 CPUs, with purity plot:
rRNA_pipeline.py -d V4 -o rrna -n sample_names.txt -t 4 -p
The rRNA pipeline will skip previous steps if stopped and rerun, unless -W is specified. To rerun starting at a particular step, delete the output files created during that step, and run again. To run only specific steps, use -s and specify steps by name. Spaces are not allowed unless the list is bound by quotes. If -s is used then -W is the default and files will be overwritten unless -w is given.
Use the following for 18S V9, running only derep, swarm, and classify steps:
rRNA_pipeline.py -d V9 -s "derep,swarm,classify"
Output files, created in the following order:
File | Description |
---|---|
fqbase1.assembled.fastq | Pear merged paired reads |
fqbase1.discarded.fastq | Pear unmerged reads |
fqbase1.unassembled.forward.fastq | Pear unmerged reads R1 |
fqbase1.unassembled.reverse.fastq | Pear unmerged reads R2 |
fqbase1.uchime | Usearch -uchime_ref list of chimeric reads |
fqbase1.filtered.fa | final set of filtered reads |
... | |
rrna.derep.fa | dereplicated reads |
rrna.derep.counts | read counts for dereplicated reads |
rrna.swarm | swarm dereplicated reads in each swarm cluster |
rrna.swarm.fa | representative swarm reads |
rrna.swarm.counts | swarm OTU sample counts table |
rrna.swarm.ggsearch | Fasta36 m8 output |
rrna.swarm.tax | swarm OTU taxonomy and counts table |
rrna.swarm.sample_corr.pdf | plot of sample correlation tree |
rrna.swarm.diversity.pdf | plots of sample richness/diversity metrics |
rrna.swarm.heatmap.pdf | plot of normalized sample OTU abundance |
rrna.taxa_groups.txt | aggregated counts for taxonomic groups |
rrna.taxa_groups.pdf | plot of taxonomic group counts |
rrna.swarm.content.fa | dereplicated reads in largest swarm clusters |
rrna.swarm.content.ggsearch | Fasta36 m8 output |
rrna.swarm.content.tax | swarm content taxonomy |
rrna.swarm.purity | swarm purity table |
rrna.swarm.purity.pdf | plot of swarm purity |
(multiple paired FASTQ base filename 'fqbase1', and single output base filename 'rrna')
In the case of 16S, files are split into two groups: those that are classified as plastid, and those that are not (determined by 'rrna.swarm.ggsearch'). The non-plastid rows/sequences are written over the base 'rrna' files, and the plastid 16S rows/sequences are written to a separate set of files, 'rrna.plastid', and swarm OTUs are re-classified by the phytoRef database.
- Make sure all dependencies are installed (see below), and make them accessible to your path.
- Databases for 18S V4/V9 derived from PR2 with updated taxonomy are included in db/. Use gunzip to uncompress these files.
- Download the full PR2 database (http://ssu-rrna.org/), and add to db/ if you want to use other regions of 18S
- Download SILVA NR database (http://www.arb-silva.de/), and add to db/ if you want to use 16S
- Test the rRNA pipeline:
rRNA_pipeline.py --test
Uncompressing the db/ files and running the test can also be accomplished with make
make
If you are having trouble with usearch or want to skip the chimera checking for any reason, then edit the init.txt file to turn it off:
chimera: off
Python 3 If you have Python 3 installed, use the files in source_py3 instead. These can be copied by:
make python3
- Python 2.7 (https://www.python.org/downloads/)
- R (https://cran.r-project.org/)
- PEAR (https://github.com/xflouris/PEAR.git)
- USEARCH v8.0 (http://www.drive5.com/usearch/download.html)
- SWARM (https://github.com/torognes/swarm)
- FASTA36 (https://github.com/wrpearson/fasta36)