Skip to content

A snakemake workflow for WGS-based tuberculosis transmission analysis

License

Notifications You must be signed in to change notification settings

cvn001/transflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💎 TransFlow: Tuberculosis Transmission Analysis Workflow

🧉 Introduction

TransFlow is a modular, flexible and user-friendly tuberculosis (TB) transmission analysis workflow based on whole genome sequencing (WGS) of Mycobacterium tuberculosis complex (MTBC). It should be noted that this workflow is specially designed for the TB transmission analysis. For other researches, like MTBC strain typing or drug-resistance prediction, please refer to other state-of-the-art tools, such as SAM-TB, TB-profiler and Mykrobe.

The workflow filters non-MTBC samples using Kraken, then preforms quality control (QC) using both FastQC and MultiQC. After that, it uses the PANPASCO workflow to do pan-genome mapping and relative pairwise SNP distance calculation for transmission analysis. Next, it infers transmission clusters and networks using transcluster and SeqTrack, separately. Notably, it provides two different clustering methods, one is SNP-threshold method (as default) and the other is transmission method. Finally, it detects risk factors that significantly associate with transmission clustering.

DOI


🐍 Workflow

Workflow


Input and output

Input format

  • Raw reads: Currently, TransFlow supports paired-end WGS reads only from Illumina sequencing systems, such as Illumina Miseq platform in 301bp paired-end format (*_1.fastq.gz, *_2.fastq.gz file).
  • Metadata: Data that provide information about epidemiological characteristics of each sample. For example, age, gender and geographical coordinate of each patient, and lineage, drug resistance of each strain (tab-delimited .txt file).

Note: For transmission network inference, the collection dates of all samples are required. The dates should be formatted as year-month-day format, e.g. "YYYY-MM-DD", "YYYY/MM/DD", "YYYY-MM", "YYYY/MM" or "YYYY". The risk factor analysis module solely relies on the epidemiological feature data provided by the user in the metadata file, the columns of which correspond to feature names, and the user can decide which feature would be analyzed in the config parameter. Missing data is allowed in the feature input, and these missing values ​​are ignored during the test.

Output directories

  • 1.Quality_control: Quality control of sequencing reads before and after the quality trimming and adapter removal step of the workflow by using FastQC.
  • 2.MTBC_identification: The percentage of reads mapping to MTBC and the category detected by kraken.
  • 3.SNP_calling: All the results regarding reads alignment and SNP calling by GATK are collected in this directory.
  • 4.SNP_distance: The directory contains the statistics and visualizations of pairwise SNP distances between genomes from PANPASCO algorithm.
  • 5.Transmission_cluster: The directory where all transmission analysis (SNP-based or transmission-based) results are placed including clustering statistics and transmission network.
  • 6.Risk_factor: The directory contains the statistical testing result of transmission risk factors.
  • 7.Summary_report: The directory contains a detailed and interactive summary report (html format) of all analyses, which can be seen by web browser.
  • temp: This is a negligible directory containing all temporary files from different modules. It can be totally deleted after the program has finished running to save storage space.

⚙️ Installation

Dependencies

All dependencies are list on workflow/envs/transflow.yaml and can be automatically downloaded and installed with conda as below, otherwise you must have these in your PATH.

  • softwares: snakemake (v6.6.0+), fastqc (v0.11.9+), multiqc (v1.11+), qualimap (v2.2.2d+), samtools (v1.12+), gatk (v3.8), picard (v2.18+), bedtools (v 2.27+), flash (v1.2.11+), seqtk (v1.3+), fastp (v0.23+), trimmomatic (v0.36+), bwa (v0.7.17), tabix (v1.11+), pandoc (v2.16+), phantomjs (v2.1.1+), kraken (v1.1.1)
  • Python (v3.6+)
    • Packages: numpy, matplotlib, pandas, biopython
  • R (v4.0+)
    • Packages: argparse, ggpubr, ggplot2, devtools, ggally, lubridate, adegenet, network, pheatmap, gridextra, sna, scales, ggsci, yaml, kableextra, data.table, dt, ggsummary, tidyverse, genomicranges, transcluster

Note: Version 3.8 of GATK is needed, as some functionalities of GATK3 were not ported to GATK4 yet.

Installation with Conda

Conda can function as a package manager and is available here.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

If you have conda make sure the bioconda and conda-forge channels are added:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Clone the repository

git clone [email protected]:cvn001/transflow.git
cd transflow

Create the environment

After cloning this repository to a folder of your choice, it is recommended to create a general transflow conda environment with the accompanying workflow/envs/transflow.yaml. In the main folder of the cloned repository, execute the following command:

conda env create -f workflow/envs/transflow.yaml

This will create a conda environment containing all dependencies for Snakemake itself.

Note: It is advised to use mamba to speed up your conda installs.

mamba env create -f workflow/envs/transflow.yaml

Activate the environment

conda activate transflow

Install the transcluster in R

devtools::install_github("JamesStimson/transcluster")

Download MiniKraken database

A pre-built 8 GB Kraken database MiniKraken DB_8GB for Kraken V1 is the suggested reference database for TransFlow. It is constructed from complete bacterial, archaeal, and viral genomes in RefSeq. According to the website, the latest version of MniKraken is 10/19/2017.


⚙️ Set up configuration

To run the complete workflow do the following:

  • Place all gzipped FASTQ files of your samples (ID_1.fastq.gz, ID_2.fastq.gz) into raw_data/. Alternatively, you can specify the location of your gzipped FASTQ files in config/configfile.yaml.
  • Replace the example metadata file (config/samples.txt) with your own sample sheet containing:
    • A row for each sample
    • The following columns for each row:
      • (A) sample (id of each sample) [Required]
      • (B) date (collection date of each sample) [Required]
      • (C) latitude (latitude of each sample) [Optional]
      • (D) longitude (longitude of each sample) [Optional]
      • (...) (Other epidemiological characters of each sample for risk factor inference) [Optional]
  • Modify the example configure file (config/configfile.yaml) to specify the appropriate:
    • metadata_file to /path/to/metadata_file
    • kraken_db to /path/to/minikraken_20171019_8GB
    • characteristics to collected characters, leave blank for skipping the risk factor analysis module.

🔧 Quick start

For testing the whole workflow, we provide an example dataset including 10 artificial samples (S1 to S10) with paired-end WGS and epidemiological data (age, gender, previous treatment) in example directory.

Automatic usage

After setting up the configure file in config directory, you can run whole pipeline in just one command:

# usual command-line
bash transflow.sh --configfile config/configfile.yaml --cores 4

The result summary report is also included in the example directory.

Useful snakemake parameters

Short Option Long Option Explanation
-k --keep If a job fails, continue with independent jobs.
-p --printshellcmds Print out the shell commands that will be executed.
-r --reason Print the reason for rule execution (Missing output, updated input etc.)
-c --cores Number of CPU cores (threads) to use for this run. With no int, it uses all.
--ri --rerun-incomplete If Snakemake marked a file as incomplete after a crash, delete and produce it again.
-n --dryrun Just pretend to run the workflow. A similar option is -S (--summary).
-q --quiet Do not output certain information. If used without arguments, do not output any progress or rule information.
--cleanup-shadow Cleanup old shadow directories which have not been deleted due to failures or power loss.
--verbose Print detailed stack traces and detailed operations. Default is False.
--nocolor Do not use a colored output. Default is False.

🔧 Step by step

The process can also run the main modules step by step, which is convenient for users to adjust some parameters. If you want to rerun this step, just delete the specified file and execute the command again. Below is a detailed description.

1. Data preprocessing and quality control

TransFlow starts with performing QC of the raw FASTQ files using FastQC. Trimming is performed using the tool fastp and subsequently, an additional QC report is generated. A single summary QC report across all samples is presented. Besides, it is also important to filter out samples that may have been significantly contaminated by foreign DNA during sample preparation. For this purpose the pair-end reads of each sample are classified through Kraken.

snakemake -s workflow/quality_control.snakefile --configfile config/configfile.yaml --cores 4

2. Reads mapping and variant calling

When the user is satisfied with the quality of the reads, the workflow proceeds to the next module: reads mapping and variant calling. The reference MTBC pangenome is provided in resources directory. The aligner and variant caller used in TransFlow is BWA-MEM and GATK3, respectively. All default parameters in this step is consistent with the mainstream MTB WGS researches and can be adjust in the configure file.

snakemake -s workflow/variant_calling.snakefile --configfile config/configfile.yaml --cores 4

3. Transmission analysis

This module is consist of four parts: pairwise distance matrix calculation, transmission clustering, transmission network reconstruction, and risk factor inference. The workflow adopts the PANPASCO algorithm for pairwise SNP distance calculation. Next, TransFlow provides two different methods for transmission clustering, SNP-based (default in configure file) and transmission-based method. Users can try both methods by setting the parameter and re-run this module, respectively. Likewise, users can try different SNP or transmission thresholds in the configure file. For the clusters with more than three samples, their transmission networks will be reconstructed using SeqTrack method. Finally, TransFlow implements risk factor inference for epidemiological characteristics.

snakemake -s workflow/transmission_analysis.snakefile --configfile config/configfile.yaml --cores 4

4. Generating summary report

After all of above steps, this module can be used to generate a visualization HTML report containing various tables and plots.

snakemake -s workflow/report_generating.snakefile --configfile config/configfile.yaml --cores 4

🏗️ Parameters in the configure file

These parameters are set in the configuration files. Please read them carefully before using TransFlow.

Parameter Description Default Note
genome_file Reference genome file provided in this repository User can generate this using seq-seq-pan
genomegaps_file File in .bed format with gaps in WGA of pan-genome provided in this repository For more details refer to seq-seq-pan
exclude_regions_file File in .bed format with positions that should be excluded from distance analysis provided in this repository It includes positions regarding to drug-resistance associated genes or repetitive regions of reference pan-genome (e.g. PPE/PE-PGRS family genes, phage sequence, insertion or mobile genetic elements). SNPs in these regions will be excluded for pairwise distance calculation
metadata_file File with list of samples with one ID per line - Missing values will be omitted
fastqdir Directory with .fastq.gz files named ID_1.fastq.gz, ID_2.fastq.gz fastq Only support Illumina PE seq data
fastqpostfix Specification of fastq.gz format; e.g. for the format sample_R1.fastq.gz put fastqpostfix: R - Don't mix different postfixes
glob_files [true or false], true: the workflow will load all fastq files in the input directory and parse all the sample names. false: the workflow will only read the fastq files of the samples in the metadata file false
kraken_cutoff Threshold of MTBC reads percentage 90 This value is aim at filtering highly contaminated samples
MTBC_reads_only Filter out only MTBC reads false This would allow for slightly contaminated samples to still be reliably processed
base_quality the quality value that a base is qualified 15 default 15 means phred quality >=Q15 is qualified
read_minimum_length Minimum length of reads to be kept after trimming 50
flash_overlap Number of overlapping base pairs of paired reads for FLASH to merge reads 10
allele_frequency_threshold Allele frequency threshold for definition of high-quality SNPs 0.75
mapping_quality_threshold Minimum mapping quality for reads to be used with GATK HaplotypeCaller 10 Keeping more low quality regions for pairwise comparison
depth_threshold Minimum coverage depth for high-quality regions 5 Keeping more low coverage regions for pairwise comparison
output_prefix Prefix for all distance files all_samples Using the project name is suggested for benefiting your management
method Transmission clustering method [SNP or trans] SNP You can try both methods by setting each separately
snp_threshold SNP distance threshold for transmission clustering 12 Initially, a maximum distance of 12 SNPs between MTB isolates was introduced to rule in a possible epidemiological link between TB cases
transmission_threshold The threshold for transmission clustering 15
clock_rate Clock rate for MTBC samples (SNPs/genome/year) 0.5 Whilst the background SNP accumulation rate for MTB has been estimated at 0.5 SNPs/genome/year, selection pressure and antibiotic resistance can influence this rate considerably.
transmission_rate The rate at which the estimated number of intermediate transmissions must be 1.2
coordinate Using sample's coordinate to improve transmission network reconstruction [true or false] true
characteristics Epidemiological characteristics for risk factor inference - All characters should be separated by spaces. Leave it blank to skip this step
sample_threads Number of threads for each sample run 1

📲 Troubleshoot

Common issues

For contacting the developer and issue reports please go to Issues.

Some samples are not in the transmission results

There are a few steps where sequences can be removed:

During the filter step:

  1. Samples that fail the current MTBC filtering criteria, as defined in the configfile.yaml file, are removed. You can modify the threshold as desired.
  2. No ambiguity in (sample collection) date.

Java on Linux insufficient memory even though there is plenty of available memory being used for caching

This is due to the kernel.pid_max limit of the OS. To solve this error, please refer to this link. To check and tuning the limit: link

FastQC Exception in thread "Thread-1" (error)

This means that FastQC may have run out of memory. Please try to increase the sample_threads parameter in the configure file.

Unlocking

After the workflow was killed (Snakemake didn’t shutdown), the workflow directory will be still locked. If you are sure, that snakemake is no longer running (ps aux | grep snake).

Unlock and clean up the working directory:

snakemake *.snakemake --unlock --cleanup-shadow

Rerun incomplete

If Snakemake marked a file as incomplete after a crash, delete and produce it again.

snakemake *.snakemake --ri

🍾 License

The code is available under the GNU GPLv3 license. The text and data are available under the CC-BY license.