Wheat SNP Caller pipeline
Prof. Simon g. Krattinger, Prof. of Plant Science
Center for Desert Agriculture,
4700 King Abdullah University of Science and Technology
Thuwal 23955-6900
Kingdom of Saudi Arabia
Nagarajan Kathiresan [email protected]
Hanin Ahmed [email protected]
Michael Abrouk [email protected]
Ibex is a heterogeneous group of nodes, a mix of AMD, Intel and Nvidia GPUs with different architectures that gives the users a variety of options to work on. Overall, Ibex is made up of 320+ nodes togeter has a heterogeneous cluster and the workload is managed by the SLURM scheduler. More information is available in https://www.hpc.kaust.edu.sa/ibex
Operating System on nodes: CentOS 7.9
Scheduler : SLURM version 20.11.8
The objective of this Wheat SNP Caller pipeline is to automate and optimize the various job steps across multiple samples. To simplify the pipeline for various project requirements, we separated the pipeline into two parts: (i) Data processing and (2) Downstream analysis using GenotypeGVCFs.
We followed different steps for genome data processing (as part of best practices pipeline) that includes (a) Read trimming (b) Read mappring (c) Mark Duplicate (d) Add/Replace read groups (e) HaplotypeCalling and (f) Compress & Index the gVCF files. All these steps in the data processing pipeline are automated based on the job dependency conditions from SLURM workload scheduler and the automated scripts will accept all the samples from the given INPUT file directory. Further, the software and/or the job steps can be modified based on the various requirements of the project. We selected the optimal number of cores for each job steps based on our vaious case studies. This automated data processing script called "workflow.sh" is available for your experiments and the pipeline stages are demonstrated in Figure (a) Pipeline steps in Data processing.
List of software
trimmomatic version 0.38
bwa version 0.7.17
samtools version 1.8
gatk version 4.1.8
tabix version 0.2.6
Figure (a) Pipeline steps in Data processing
As we know, the Wheat genome size is 5x bigger than human genome. To efficiently process the GenotypeGVCFs, every chromosome will be split into multiple chunks based on the availability of the computing resources. For example, the chromosome split intervals will be smaller only when the larger number of CPU resources available. We furtehr developed a load balancing model for optimal chromosome splitting and this feature is useful for (a) reducing the execution time and (b) majority of the chunks will be completed at the sametime. Figure (b) will be an example for chromosome split table, which has four entities: (i) Chromosome number (ii) Chunk number (iii) start position and (iv) end position. This chromosome split table provides disjoint chunks and hence any GATK tools can be executed independently without any prerequisites.
Figure (b) Chromosome split intervals
Based on the chromosome split table, we developed an automated downstrean analysis workflow described in figure (c). Here, variant instance of downstream analysis workflow will be executed with disjoint chunks intervals. Once all the chunks are executed, GATK mergeVCF will combine all the intervals in the same order of the chromosome split.
Figure (c) Downstrean analysis