This package contains the nf-core/rnaseq pipeline (version 3.14.0) in the Viash framework. We stick to the nf-core pipeline as much as possible. This also means that we create a subworkflow for the main stages of the pipeline as depicted in the nf-core README.
The workflow is built in a modular fashion, where most of the base functionality is provided by components from biobox, supplemented by custom base components and workflow components in this package. This architecture ensures both flexibility and reproducibility while leveraging established bioinformatics tools.
Each of the workflow components is implemented as a stand-alone module with a standardized interface. This design philosophy offers several advantages:
-
Isolation of Tools and Functionality: Each subworkflow as well as its individual components can be executed independently as stand-alone entities when you need only specific functionality.
-
Consistent Interfaces: All components follow standardized input/output conventions, making it easy to connect or replace them.
-
Reusability: Each component or workflow can be integrated seamlessly as a dependency in another workflow.
The end-to-end rnaseq workflow has 6 sub-workflows that can also be run independently.
-
Prepare genome: Preparation of all the reference data required for downstream analysis, i.e., uncompress provided reference data or generate the required index files (for STAR, Salmon, Kallisto, RSEM, BBSplit).
-
Pre-processing: Quality control on the input reads, performing FastQC, extracts UMIs, trims adapters, and removal of ribosomal RNA reads. Adapters can be trimmed using either Trim galore! or fastp (work in progress).
-
Genome alignment and quantification: Genome alignment using STAR and transcript quantification using Salmon or RSEM (using RSEM’s built-in support for STAR) (work in progress). Alignment sorting and indexing, as well as computation of statistics from the BAM files is performed using Samtools. UMI-based deduplication is also performed.
-
Post-processing: Marking of duplicate reads (picard MarkDuplicates), transcript assembly and quantification (StringTie), and creation of bigWig coverage files.
-
Pseudo alignment and quantification: Pseudo alignment and transcript quantification using Salmon or Kallisto.
-
Final QC: A quality control workflow performing RSeQC, dupRadar, Qualimap, Preseq, DESeq2 and featureCounts. It presents QC for raw reads, alignments, gene biotype, sample similarity, and strand specificity (MultiQC).
The rnaseq package is available via Viash Hub, where you can receive instructions on how to run the end-to-end workflow as well as individual subworkflows or components.
As test data, we can use the small dataset nf-core provided with their
test
profile:
https://github.com/nf-core/test-datasets/tree/rnaseq3/testdata/GSE110004.
A simple script has been provided to fetch those files from the github
repository and store them under testData/minimal_test
(the
subdirectory is created to support full_test
later as well):
bin/get_minimal_test_data.sh
.
Additionally, a script has been provided to fetch some additional
resources for unit testing the components. Thes will be stored under
testData/unit_test_resources
: bin/get_unit test_data.sh
The test data can be downloaded by running the following commands:
bin/minimal_test.sh
bin/get_minimal_test_data.sh
To run the end-to-end workflow, browse to the rnaseq workflow on Viash Hub. Here you can find an overview on the formats of the input and output files, as well as a detailed list of required and optional parameters to run the workflow.
The workflow can be run via the CLI with Nextflow or on Seqera Cloud.
After having
nextflow
installed, we can now follow the instructions on screen by clicking
launch
.
- The first step is to select the execution environment, which is Nextflow in this example.
- We can now fill in the parameters for the workflow. In this example,
use the locations of the test data that were downloaded earlier. We
select the
advanced form
option, to be able to process multiple samples in parallel.
We fill out the global parameters first - those are the parameters that apply to all samples.
Then, we fill in our parameter sets - this is one parameter set for each samples. Note that each sample can consist of multiple fastq files.
- Once we hit launch, we can execute the workflow by following the instructions on the screen:
cat > params.yaml <<'EOM'
param_list:
- id: "WT_REP1"
fastq_1: [ "SRR6357070_1.fastq.gz", "SRR6357071_1.fastq.gz" ]
fastq_2: [ "SRR6357070_2.fastq.gz", "SRR6357071_2.fastq.gz" ]
strandedness: "reverse"
- id: "WT_REP2"
fastq_1: [ "SRR6357072_1.fastq.gz" ]
fastq_2: [ "SRR6357072_2.fastq.gz" ]
strandedness: "reverse"
fasta: "testData/minimal_test/reference/genome.fasta"
publish_dir: "full_pipeline_test/"
gtf: "testData/minimal_test/reference/genes.gtf.gz"
transcript_fasta: "testData/minimal_test/reference/transcriptome.fasta"
EOM
nextflow run https://packages.viash-hub.com/vsh/rnaseq.git \
-revision v0.2.0 \
-main-script target/nextflow/workflows/rnaseq/main.nf \
-params-file params.yaml \
-latest \
-resume
It’s also possible to run the workflow directly on Seqera Cloud. The required Nextflow schema files are provided with the workflows. Since Seqera Cloud does not support multiple-value parameters when using the form-based input, we will use Viash Hub to launch the workflow.
- First, we need to create an API token for your Seqera Cloud account.
- Next, we can launch the workflow by selecting
Seqera Cloud
as execution environment. Here you can add your API key, as well as the Workspace ID and Compute Environment.
- We can now fill in the parameters, as described under
Run using Nextflow
. Note that a direct link to the test data needs to be provided for Seqera Cloud execution, e.g. to test data in a GitHub repository or data on a cloud storage service. - By launching the workflow via Viash Hub, it will be executed on Seqera Cloud in your workspace environment of choice.
Nextflow’s labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU and memory:
lowmem
,midmem
,highmem
,veryhighmem
singlecpu
,lowcpu
,midcpu
,highcpu
,veryhighcpu
The defaults for these labels can be found at
src/workflows/utils/labels.config
. Nextflow checks that the specified
resources for a process do not exceed what is available on the machine
and will not start if it does. Create your own config file to tune the
labels to your needs, for example:
// Resource labels
withLabel: singlecpu { cpus = 1 }
withLabel: lowcpu { cpus = 2 }
withLabel: midcpu { cpus = 4 }
withLabel: highcpu { cpus = 8 }
withLabel: veryhighcpu { cpus = 16 }
withLabel: lowmem { memory = { get_memory( 4.GB * task.attempt ) } }
withLabel: midmem { memory = { get_memory( 16.GB * task.attempt ) } }
withLabel: highmem { memory = { get_memory( 24.GB * task.attempt ) } }
withLabel: veryhighmem { memory = { get_memory( 48.GB * task.attempt ) } }
When starting nextflow using the CLI, you can use the -c
flag to
provide the file to NextFlow and overwrite the defaults.
This workflow was developed by Data Intuitive. Other contributions are welcome.