rRNA_pipeline

Pipeline for read filtering, swarm OTU clustering, and taxonomic classification of ssu-rRNA

Usage

File	Description
rRNA_pipeline.py	Full rRNA pipeline

init.txt	optional paths to alternate databases
db/	ssu-rRNA databases
fastq_filter.py	FASTQ filtering
fasta_dereplicate.py	FASTA dereplication
swarm_map.py	run swarm
swarm_classify_taxonomy.py	classify swarm OTUs
group_taxa.py	aggregate taxonomic group counts
purity_plot.py	classifify OTU content
plot_OTU_purity.r	produce OTU purity plot
plot_sample_correlations.r	produce sample tree plot
plot_taxa_groups.r	produce taxonomic groups plot
plot_diversity.r	produce sample diversity plot
plot_heatmap.r	produce sample/OTU heatmap

rRNA_pipeline v0.4 (May 21, 2016)
Full ssu-rRNA, swarm OTU classification pipeline

Usage: rRNA_pipeline.py (options)
   -d name          : database name (16S, 18S, V4, V9)
   -q dir           : FASTQ folder
   -o file          : base filename for results (default: rrna)
   -n file          : sample names file (optional)
   -p               : calculate/plot OTU purity
   -m int           : minimum quality score for FASTQ (default: 30)
   -s, --steps list : run only the steps in list (default: All)
   -t, --cpus int   : number of processes (default: 1)
   -w               : no overwrite of files, skip completed steps (default)
   -W, --overwrite  : overwrite files (default if -s)
   -h, --help       : help
   -v, --verbose    : more information to stderr

   If -s is used, then -W overwrite is the default
   Steps list can include any of the following: 
      All, merge_fastq, chimera, filter_fasta, derep
      swarm, classify, plots, purity, split_plastid
      plastid_classify, plastid_plots, plastid_purity

Example: rRNA_pipeline.py -d 16S -q ./fastq

If using alternate databases for 16S, Plastid, 18S, 18S_V4, or 18S_V9, specify paths in init.txt.

Basic operation for 16S:

rRNA_pipeline.py -d 16S -q ./fastq

To replace FASTQ filenames with sample names in all output, use -n to specify tab-delimited file (sample_name, FASTQ base name). FASTQ base names may be followed by any of [_R1, _R2, .filtered, .fastq, .fq] in the full FASTQ file name.

The basic pipeline runs relatively quickly, however the extra calculation of OTU purity takes much longer. Use -p to calculate and plot purity.

Use the following for 18S V4, with sample names, run on 4 CPUs, with purity plot:

rRNA_pipeline.py -d V4 -o rrna -n sample_names.txt -t 4 -p

The rRNA pipeline will skip previous steps if stopped and rerun, unless -W is specified. To rerun starting at a particular step, delete the output files created during that step, and run again. To run only specific steps, use -s and specify steps by name. Spaces are not allowed unless the list is bound by quotes. If -s is used then -W is the default and files will be overwritten unless -w is given.

Use the following for 18S V9, running only derep, swarm, and classify steps:

rRNA_pipeline.py -d V9 -s "derep,swarm,classify"

Output files, created in the following order:

File	Description
fqbase1.assembled.fastq	Pear merged paired reads
fqbase1.discarded.fastq	Pear unmerged reads
fqbase1.unassembled.forward.fastq	Pear unmerged reads R1
fqbase1.unassembled.reverse.fastq	Pear unmerged reads R2
fqbase1.uchime	Usearch -uchime_ref list of chimeric reads
fqbase1.filtered.fa	final set of filtered reads
...

rrna.derep.fa	dereplicated reads
rrna.derep.counts	read counts for dereplicated reads
rrna.swarm	swarm dereplicated reads in each swarm cluster
rrna.swarm.fa	representative swarm reads
rrna.swarm.counts	swarm OTU sample counts table
rrna.swarm.ggsearch	Fasta36 m8 output
rrna.swarm.tax	swarm OTU taxonomy and counts table
rrna.swarm.sample_corr.pdf	plot of sample correlation tree
rrna.swarm.diversity.pdf	plots of sample richness/diversity metrics
rrna.swarm.heatmap.pdf	plot of normalized sample OTU abundance
rrna.taxa_groups.txt	aggregated counts for taxonomic groups
rrna.taxa_groups.pdf	plot of taxonomic group counts
rrna.swarm.content.fa	dereplicated reads in largest swarm clusters
rrna.swarm.content.ggsearch	Fasta36 m8 output
rrna.swarm.content.tax	swarm content taxonomy
rrna.swarm.purity	swarm purity table
rrna.swarm.purity.pdf	plot of swarm purity

(multiple paired FASTQ base filename 'fqbase1', and single output base filename 'rrna')

In the case of 16S, files are split into two groups: those that are classified as plastid, and those that are not (determined by 'rrna.swarm.ggsearch'). The non-plastid rows/sequences are written over the base 'rrna' files, and the plastid 16S rows/sequences are written to a separate set of files, 'rrna.plastid', and swarm OTUs are re-classified by the phytoRef database.

Installation

Make sure all dependencies are installed (see below), and make them accessible to your path.
Databases for 18S V4/V9 derived from PR2 with updated taxonomy are included in db/. Use gunzip to uncompress these files.
Download the full PR2 database (http://ssu-rrna.org/), and add to db/ if you want to use other regions of 18S
Download SILVA NR database (http://www.arb-silva.de/), and add to db/ if you want to use 16S
Test the rRNA pipeline:

rRNA_pipeline.py --test

Uncompressing the db/ files and running the test can also be accomplished with make

make

If you are having trouble with usearch or want to skip the chimera checking for any reason, then edit the init.txt file to turn it off:

chimera: off

Python 3 If you have Python 3 installed, use the files in source_py3 instead. These can be copied by:

make python3

Dependencies

Python 2.7 (https://www.python.org/downloads/)
R (https://cran.r-project.org/)
PEAR (https://github.com/xflouris/PEAR.git)
USEARCH v8.0 (http://www.drive5.com/usearch/download.html)
SWARM (https://github.com/torognes/swarm)
FASTA36 (https://github.com/wrpearson/fasta36)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rRNA_pipeline

Pipeline for read filtering, swarm OTU clustering, and taxonomic classification of ssu-rRNA

Usage

Installation

Dependencies

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
db		db
source_py3		source_py3
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fasta_dereplicate.py		fasta_dereplicate.py
fastq_filter.py		fastq_filter.py
group_taxa.py		group_taxa.py
happyfile.py		happyfile.py
init.txt		init.txt
plot_OTU_purity.r		plot_OTU_purity.r
plot_diversity.r		plot_diversity.r
plot_heatmap.r		plot_heatmap.r
plot_sample_correlations.r		plot_sample_correlations.r
plot_taxa_groups.r		plot_taxa_groups.r
purity_plot.py		purity_plot.py
rRNA_pipeline.py		rRNA_pipeline.py
swarm_classify_taxonomy.py		swarm_classify_taxonomy.py
swarm_map.py		swarm_map.py
taxa_groups.txt		taxa_groups.txt

License

allenlab/rRNA_pipeline

Folders and files

Latest commit

History

Repository files navigation

rRNA_pipeline

Pipeline for read filtering, swarm OTU clustering, and taxonomic classification of ssu-rRNA

Usage

Installation

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages