Whole Genome Sequencing NGS Data Analysis in the Cloud with WDL/Cromwell and Docker

I wrote this WDL script to perform whole genome sequencing (WGS) variant calling analysis on Google Cloud Platform (GCP) cloud computing resources using GATK (Genome Analysis Toolkit) and other tools in Docker containers. Here's a breakdown of its components and functionalities:

Overall Functionality:

The workflow fetches sequencing data from NCBI's Sequence Read Archive (SRA) database, processes the data to generate an unaligned BAM file, aligns reads to a reference genome, performs quality control, marks duplicates, recalibrates base quality scores, performs variant calling using GATK HaplotypeCaller, filters variants using a Convolutional Neural Network (CNN), and finally, produces various outputs for downstream analysis.

Input Files Required to Run:

Human reference genome files (fasta, index, dict, alt, bwt, sa, amb, ann, pac)
Contamination sites files (ud, bed, mu)
SNP database (dbSNP_vcf, dbSNP_vcf_index)
Known indel sites (VCFs and their indices)
WGS calling interval list
Resource VCFs and their indices (e.g., mills, gnomAD)
Other associated files for alignment and variant calling

Required Bioinformatics Tools:

SRA Toolkit for fetching data from the SRA database
BWA for read alignment
GATK tools for alignment, variant calling, quality control, etc
SAMtools, Picard, and bedtools (implicitly used within the GATK tools)

Outputs Produced:

The workflow generates various outputs, including:

Unaligned BAM file from SRA data retrieval
Aligned BAM files
Marked duplicates metrics file
Recalibration reports and files
CRAM file from the final recalibrated BAM file
VCF files: raw variants, CNN-filtered variants
Validation reports
Self-sample contamination information

I then used the resulting variant call files to build machine learning models to detect artifactual variant calls in Python with scikit-learn and LightGBM and compared their performance to GATK's 1D CNN.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
WGS_Call_Variants_RAM_optimization.inputs.json		WGS_Call_Variants_RAM_optimization.inputs.json
WGS_Call_Variants_RAM_optimization.wdl		WGS_Call_Variants_RAM_optimization.wdl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whole Genome Sequencing NGS Data Analysis in the Cloud with WDL/Cromwell and Docker

Overall Functionality:

Input Files Required to Run:

Required Bioinformatics Tools:

Outputs Produced:

About

Releases

Packages

Languages

felixm3/whole-genome-sequencing-cloud

Folders and files

Latest commit

History

Repository files navigation

Whole Genome Sequencing NGS Data Analysis in the Cloud with WDL/Cromwell and Docker

Overall Functionality:

Input Files Required to Run:

Required Bioinformatics Tools:

Outputs Produced:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages