This is a bioinformatics analysis pipeline used for small RNA sequencing data with a focus on miRNA.
This pipeline was adapted from nfcore/smrnaseq version 1.0.0
. However, numerous changes have been made to the pipeline. They include, but are not limite to:
- Added differential miRNA expression analysis
- The library composition analysis now includes other small RNA types in addition to rRNA, tRNA, miRNAs
- Added differential expression analysis for genes of other small RNA types in addition to miRNAs
- Numerous report improvements including custom MultiQC modules
- Upgrade to Nextflow DSL2
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
- Raw read QC (
FastQC
) - Adapter trimming (
Trim Galore!
) - Collapse reads (
seqcsluter
) - Alignment against miRBase hairpin with reads from step 3 (
Bowtie1
) - Alignment with reads from step 2 against mature tRNA (GtRNAdb), mitochondrial tRNA (Ensembl), rRNA (Ensembl and UCSC Repeatmasker), lncRNA, scaRNA, snoRNA, snRNA, miscRNA (Ensembl), and miRNA hairpin (miRBase) (
Bowtie1
) - Read count quantification using RSEM with files from step 5 (
RSEM
) - RSEM results summary with tximport (
tximport
) and RNA type quantification - Differential expression analysis of non-miRNA RNA types calculated with DESeq2 from step 7 tximport summary (
DESeq2
) - miRNA and isomiR annotation from step 4 (
mirtop
) - Sample comparison and statistical analysis (
isomiRs
) - miRNA quality control using reads from step 2 (
mirtrace
) - Present stats, plots, results (
MultiQC
)
It is preferable that you run the pipeline using Aladdin Bioinformatics Platform. Running this on your own computer or cloud is also possible. However, you would need to modify the genomes config file [conf/igenomes.config] to provide your own genome files.
nextflow run Zymo-Research/aladdin-smrnaseq \
--design "<design CSV file>" \
--genome 'Homo_sapiens(GRCh38)' \
--protocol zymo \
-profile awsbatch \
-work-dir "<work dir on S3>" \
--awsqueue "<SQS ARN>" \
--outdir "<output dir on S3>" \
--name "my_analysis" \
-r "0.0.1" \
- The options
-design
and--genome
are required. Remember to enclose--genome
parameter value in () because it contains special characters. - The options
-profile
,-work-dir
,--outdir
, and--awsqueue
are required only when running pipelines on AWS Batch, but are highly recommended. - The option
-r
helps pin workflow to a specific release on GitHub. - The option
--name
will add a custom title to the report. - The design CSV file must have the following format.
group,sample,read_1,read_2
Control,Sample1,s3://mybucket/this_is_s1_R1.fastq.gz,
Control,Sample2,s3://mybucket/this_is_s2_R1.fastq.gz,
Experiment,Sample3,s3://mybucket/that_is_s3_R1.fastq.gz,
Experiment,Sample4,s3://mybucket/that_be_s4_R1.fastq.gz,
The header line must be present and cannot be changed. Sample labels and group names must contain only alphanumerical characters or underscores, must not start with "R1" or "R2", and must start with a letter. Sample labels and group names also cannot include phrases that will be automatically filtered by MultiQC. A list of terms unsuitable for sample and group labels in this pipeline can be viewed in the MultiQC source code here. Only single-end sequencing is supported. Full S3 paths of R1 FASTQ files should be provided. R2 FASTQ files are ignored if provided. If you do not wish to carry out statistical comparisons of samples, simply leave the group column blank (but keep the comma).
nextflow run Zymo-Research/aladdin-smrnaseq \
--design "<design CSV file>" \
--genome 'Homo_sapiens(GRCh38)' \
--protocol zymo \
-profile docker \
--outdir "<output dir on S3>" \
--name "my_analysis" \
-r "0.0.1" \
For more options to customize your run, please refer to usage documentation.