Wheat-SNPCaller

Wheat SNP Caller pipeline

Principal investigators (PI)

Prof. Simon g. Krattinger, Prof. of Plant Science
Center for Desert Agriculture,
4700 King Abdullah University of Science and Technology
Thuwal 23955-6900
Kingdom of Saudi Arabia

Authors:

Nagarajan Kathiresan [email protected]
Hanin Ahmed [email protected]
Michael Abrouk [email protected]

About Ibex cluster

Ibex is a heterogeneous group of nodes, a mix of AMD, Intel and Nvidia GPUs with different architectures that gives the users a variety of options to work on. Overall, Ibex is made up of 320+ nodes togeter has a heterogeneous cluster and the workload is managed by the SLURM scheduler. More information is available in https://www.hpc.kaust.edu.sa/ibex

Operating System on nodes: CentOS 7.9
Scheduler : SLURM version 20.11.8

Wheat SNPCaller pipeline

The objective of this Wheat SNP Caller pipeline is to automate and optimize the various job steps across multiple samples. To simplify the pipeline for various project requirements, we separated the pipeline into two parts: (i) Data processing and (2) Downstream analysis using GenotypeGVCFs.

1. Data processing

We followed different steps for genome data processing (as part of best practices pipeline) that includes (a) Read trimming (b) Read mappring (c) Mark Duplicate (d) Add/Replace read groups (e) HaplotypeCalling and (f) Compress & Index the gVCF files. All these steps in the data processing pipeline are automated based on the job dependency conditions from SLURM workload scheduler and the automated scripts will accept all the samples from the given INPUT file directory. Further, the software and/or the job steps can be modified based on the various requirements of the project. We selected the optimal number of cores for each job steps based on our vaious case studies. This automated data processing script called "workflow.sh" is available for your experiments and the pipeline stages are demonstrated in Figure (a) Pipeline steps in Data processing.

List of software
trimmomatic version 0.38
bwa version 0.7.17
samtools version 1.8
gatk version 4.1.8
tabix version 0.2.6

Figure (a) Pipeline steps in Data processing

2. Chromosome split table

As we know, the Wheat genome size is 5x bigger than human genome. To efficiently process the GenotypeGVCFs, every chromosome will be split into multiple chunks based on the availability of the computing resources. For example, the chromosome split intervals will be smaller only when the larger number of CPU resources available. We furtehr developed a load balancing model for optimal chromosome splitting and this feature is useful for (a) reducing the execution time and (b) majority of the chunks will be completed at the sametime. Figure (b) will be an example for chromosome split table, which has four entities: (i) Chromosome number (ii) Chunk number (iii) start position and (iv) end position. This chromosome split table provides disjoint chunks and hence any GATK tools can be executed independently without any prerequisites.

Figure (b) Chromosome split intervals

3. Downstream analysis

Based on the chromosome split table, we developed an automated downstrean analysis workflow described in figure (c). Here, variant instance of downstream analysis workflow will be executed with disjoint chunks intervals. Once all the chunks are executed, GATK mergeVCF will combine all the intervals in the same order of the chromosome split.

Figure (c) Downstrean analysis

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
01_pre-processing.sh		01_pre-processing.sh
02_chunking.sh		02_chunking.sh
03_mergeme.sh		03_mergeme.sh
README.md		README.md
WheatGenome_SNPCalling.batch		WheatGenome_SNPCalling.batch
merge_all_using_GATK.sh		merge_all_using_GATK.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wheat-SNPCaller

Principal investigators (PI)

Authors:

About Ibex cluster

Wheat SNPCaller pipeline

1. Data processing

2. Chromosome split table

3. Downstream analysis

About

Releases

Packages

Languages

IBEXCluster/Wheat-SNPCaller

Folders and files

Latest commit

History

Repository files navigation

Wheat-SNPCaller

Principal investigators (PI)

Authors:

About Ibex cluster

Wheat SNPCaller pipeline

1. Data processing

2. Chromosome split table

3. Downstream analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages