8000 GitHub - allenlab/rRNA_pipeline: ssu-rRNA swarm OTU classification pipeline
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

allenlab/rRNA_pipeline

 
 

Repository files navigation

rRNA_pipeline

Pipeline for read filtering, swarm OTU clustering, and taxonomic classification of ssu-rRNA

rrna_pipeline_procedure

Usage

File Description
rRNA_pipeline.py Full rRNA pipeline
init.txt optional paths to alternate databases
db/ ssu-rRNA databases
fastq_filter.py FASTQ filtering
fasta_dereplicate.py FASTA dereplication
swarm_map.py run swarm
swarm_classify_taxonomy.py classify swarm OTUs
group_taxa.py aggregate taxonomic group counts
purity_plot.py classifify OTU content
plot_OTU_purity.r produce OTU purity plot
plot_sample_correlations.r produce sample tree plot
plot_taxa_groups.r produce taxonomic groups plot
plot_diversity.r produce sample diversity plot
plot_heatmap.r produce sample/OTU heatmap
rRNA_pipeline v0.4 (May 21, 2016)
Full ssu-rRNA, swarm OTU classification pipeline

Usage: rRNA_pipeline.py (options)
   -d name          : database name (16S, 18S, V4, V9)
   -q dir           : FASTQ folder
   -o file          : base filename for results (default: rrna)
   -n file          : sample names file (optional)
   -p               : calculate/plot OTU purity
   -m int           : minimum quality score for FASTQ (default: 30)
   -s, --steps list : run only the steps in list (default: All)
   -t, --cpus int   : number of processes (default: 1)
   -w               : no overwrite of files, skip completed steps (default)
   -W, --overwrite  : overwrite files (default if -s)
   -h, --help       : help
   -v, --verbose    : more information to stderr

   If -s is used, then -W overwrite is the default
   Steps list can include any of the following: 
      All, merge_fastq, chimera, filter_fasta, derep
      swarm, classify, plots, purity, split_plastid
      plastid_classify, plastid_plots, plastid_purity

Example: rRNA_pipeline.py -d 16S -q ./fastq

If using alternate databases for 16S, Plastid, 18S, 18S_V4, or 18S_V9, specify paths in init.txt.

Basic operation for 16S:

rRNA_pipeline.py -d 16S -q ./fastq

To replace FASTQ filenames with sample names in all output, use -n to specify tab-delimited file (sample_name, FASTQ base name). FASTQ base names may be followed by any of [_R1, _R2, .filtered, .fastq, .fq] in the full FASTQ file name.

The basic pipeline runs relatively quickly, however the extra calculation of OTU purity takes much longer. Use -p to calculate and plot purity.

Use the following for 18S V4, with sample names, run on 4 CPUs, with purity plot:

rRNA_pipeline.py -d V4 -o rrna -n sample_names.txt -t 4 -p

The rRNA pipeline will skip previous steps if stopped and rerun, unless -W is specified. To rerun starting at a particular step, delete the output files created during that step, and run again. To run only specific steps, use -s and specify steps by name. Spaces are not allowed unless the list is bound by quotes. If -s is used then -W is the default and files will be overwritten unless -w is given.

Use the following for 18S V9, running only derep, swarm, and classify steps:

rRNA_pipeline.py -d V9 -s "derep,swarm,classify"

Output files, created in the following order:

File Description
fqbase1.assembled.fastq Pear merged paired reads
fqbase1.discarded.fastq Pear unmerged reads
fqbase1.unassembled.forward.fastq Pear unmerged reads R1
fqbase1.unassembled.reverse.fastq Pear unmerged reads R2
fqbase1.uchime Usearch -uchime_ref list of chimeric reads
fqbase1.filtered.fa final set of filtered reads
...
rrna.derep.fa dereplicated reads
rrna.derep.counts read counts for dereplicated reads
rrna.swarm swarm dereplicated reads in each swarm cluster
rrna.swarm.fa representative swarm reads
rrna.swarm.counts swarm OTU sample counts table
rrna.swarm.ggsearch Fasta36 m8 output
rrna.swarm.tax swarm OTU taxonomy and counts table
rrna.swarm.sample_corr.pdf plot of sample correlation tree
rrna.swarm.diversity.pdf plots of sample richness/diversity metrics
rrna.swarm.heatmap.pdf plot of normalized sample OTU abundance
rrna.taxa_groups.txt aggregated counts for taxonomic groups
rrna.taxa_groups.pdf plot of taxonomic group counts
rrna.swarm.content.fa dereplicated reads in largest swarm clusters
rrna.swarm.content.ggsearch Fasta36 m8 output
rrna.swarm.content.tax swarm content taxonomy
rrna.swarm.purity swarm purity table
rrna.swarm.purity.pdf plot of swarm purity

(multiple paired FASTQ base filename 'fqbase1', and single output base filename 'rrna')

In the case of 16S, files are split into two groups: those that are classified as plastid, and those that are not (determined by 'rrna.swarm.ggsearch'). The non-plastid rows/sequences are written over the base 'rrna' files, and the plastid 16S rows/sequences are written to a separate set of files, 'rrna.plastid', and swarm OTUs are re-classified by the phytoRef database.

Installation

  1. Make sure all dependencies are installed (see below), and make them accessible to your path.
  2. Databases for 18S V4/V9 derived from PR2 with updated taxonomy are included in db/. Use gunzip to uncompress these files.
  3. Download the full PR2 database (http://ssu-rrna.org/), and add to db/ if you want to use other regions of 18S
  4. Download SILVA NR database (http://www.arb-silva.de/), and add to db/ if you want to use 16S
  5. Test the rRNA pipeline:
rRNA_pipeline.py --test

Uncompressing the db/ files and running the test can also be accomplished with make

make

If you are having trouble with usearch or want to skip the chimera checking for any reason, then edit the init.txt file to turn it off:

chimera: off

Python 3 If you have Python 3 installed, use the files in source_py3 instead. These can be copied by:

make python3

Dependencies

About

ssu-rRNA swarm OTU classification pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.6%
  • R 4.4%
0