CompleteWGS (cWGS)

This is a pipline the enables the mapping, variant calling, and phasing of input fastq files from a PCR free (PF) and a Complete Genomics' DNBSEQ Complete WGS (cWGS) (a DNA cobarcoding technology, previously known as stLFR) library of the same sample. Running this pipeline results in a highly accurate and complete phased vcf. We recommend at least 40X depth for the PCR free library and 30X depth for the cWGS library. Below is a flow chart which summarizes the pipeline processes. *Note, SV detection has not yet been enabled on the current version of the pipeline.

Requirements

Hardware requirements
Multiple core computer (default >=48CPU)
Minium 72GB RAM
Exact storage may vary depending on sample count and coverage, expect 1TB per sample.
Software requirements
Linux CentOS >=7
You may need root access to install Singularity (Singularity is a more secure container platform as it does not require root access on execution, while Docker does.)

Installation

On a Linux server, install singularity >= 3.8.1 with root on every nodes.
Download the singularity images (internet connection required) by the following commands:

cat <<EOF > CWGS.def
Bootstrap: docker
From: stlfr/cwgs:1.0.6
%post
    cp /90-environment.sh /.singularity.d/env/
EOF

singularity build --fakeroot CWGS.sif CWGS.def

The users without root permission may build the .sif file locally then upload to server.
If the singularity doesn't support --fakeroot, you need sudo permission to run this command: sudo singularity build CWGS.sif CWGS.def

singularity exec -B`pwd -P` --pwd `pwd -P` CWGS.sif cp -rL /usr/local/bin/CWGS /usr/local/bin/runit /usr/local/app/CWGS/PARAMS.txt /usr/local/app/CWGS/demo .

Download the database (internet connection required) by this command:

./CWGS -createdb

Or for MegaBolt or ZBolt nodes ((MGI's Bioinformatics analysis accelerator, including MegaBOLT/ZBOLT/ZBOLT Pro)

./CWGS -createdb --megabolt

This command will download around 32G data from internet and build index locally, which will occupy another 30G storage.

Test demo data:

cat << EOF > samplelist.txt
sample  stlfr1                      stlfr2                      pcrfree1                 pcrfree2
demo    demo/stLFR_demo_1M_1.fq.gz  demo/stLFR_demo_1M_2.fq.gz  demo/PF_demo_1M_1.fq.gz demo/PF_demo_1M_2.fq.gz
EOF
 
./CWGS samplelist.txt -local

Test demo data on clusters by SGE (Sun Grid Engine):

./CWGS samplelist.txt --queue mgi.q --project none

Test demo data on clusters by SGE (Sun Grid Engine) with MegaBolt/ZBolt nodes:

./CWGS samplelist.txt -bolt --queue mgi.q --project none --boltq fpga.q

Run the pipeline

Note that the order of parameters matters: single dash parameters (-opt) should be placed before all double dash parameters (--opt)

Generate sample.list. start from raw fastq (cWGS/stLFR PE100, R2 is 100+42bp barcode e.g.) files (default) E.g.
```
cat << EOF > sample.list
sample	stlfr1	stlfr2	pcrfree1	pcrfree2
demo1	/path/to/cWGS_01_1.fq.gz	/path/to/cWGS_01_2.fq.gz	/path/to/PCRfree_01_1.fq.gz	/path/to/PCRfree_01_2.fq.gz
demo2	/path/to/cWGS_02_1.fq.gz	/path/to/cWGS_02_2.fq.gz	/path/to/PCRfree_02_1.fq.gz	/path/to/PCRfree_02_2.fq.gz
EOF
```
*paths above can be both absolute and relative

start from barcode (BC) split/deconvolution fastq files (cWGS/stLFR PE100, R2 is 100bp e.g.) (set --skipBarcodeSplit true) format same as above.

start from PCR-free and cWGS bam files (set --frombam true) E.g.
```
cat << EOF > sample.list
sample	stlfrbam	pfbam
demo1	/path/to/cWGS_01.bam	/path/to/PCRfree_01.bam
demo2	/path/to/cWGS_02.bam	/path/to/PCRfree_02.bam
EOF
```
Updates are pushed to the github module and script folders, use the latest ones. Currently, it's recommended to remove PF reads of MAPQ<3 with the --pfmapq tag. Also, to customize and make the pipeline adapt to your needs, you may revise the scripts. An example run:
```
./CWGS sample.list -sing /usr/local/bin/singularity -module <module_path> -script <script_path> -local -debug --use_megabolt false --pfmapq 3
```

Run settings Set CPU

--cpu2 INT
  Specify cpu number for QC, markdup, bam downsample, merge bam, bam stats calculation. [24]

--cpu3 INT
  Specify cpu number for alignment and short variants calling (including BQSR and VQSR, if specified). [48]

Sample the input fastq files

--sampleFq BOOL
  if you want to initially sample the fastq file, set it true. [false]
the following settings are valid only sampleFq is true
--stLFR_fq_cov INT [only valid when '--sampleFq true']
sample cWGS (stLFR) reads to this coverage [40] 

--PF_fq_cov INT [only valid when '--sampleFq true']
sample PCR-free reads to this coverage [50]

Alignment (BWA or Lariat) and variant calling (GATK or Deepvariant (DV)) relevant settings

--align_tool STRING
  Specify the alignment tool for cWGS reads. [lariat]
  Supports:
    bwa
    lariat
    bwa,lariat (this will execute both)

--var_tool STRING
  Specify the variant calling tools for merged bam file. [dv]
  Supports:
    gatk
    dv (DeepVariant)
    gatk,dv (this will execute both)

*If two alignment tools ("lariat,bwa") and two variant calling programs ("gatk,dv") are specified, four result sets will be generated.

--gatk_version STRING [only valid when '--var_tool' contains "gatk"]
  Specify the GATK version. [v4]
  Supports:
    v3
    v4

--run_bqsr BOOL [only valid when '--var_tool' contains "gatk"]
  Run Base Quality Score Recalibration (BQSR) of GATK. [true]

--run_vqsr BOOL [only valid when '--var_tool' contains "gatk"]
  Run Variant Quality Score Recalibration (VQSR) of GATK. [true]

--split_by_intervals BOOL [only valid when '--var_tool' contains "gatk" and '--use_megabolt' is false]
  Utilizes -L option for GATK haplotypecaller; split by chromosome. [true]

--dv_version STRING [only valid when '--var_tool' contains "dv"]
  Specify the DeepVariant version. [v1.6]
  Supports: 
    v1.6
    v0.7
  Current MegaBOLT DeepVariant version is v0.7; therefore, if you specify this option to "v1.6", MegaBOLT will not be used even if '--use_megabolt' is true.

Markdup

--markdup STRING
Specify the mark duplicates tool. [biobambam2]
Supports: 
  biobambam2 (much faster than picard; recommended)
  picard
  gatk4 (MarkDuplicatesSpark)
  sambamba (not recommended)

Downsample the bam file

--sampleBam BOOL
  Whether downsample the cWGS bam and PCR-free bam. [true]

--stLFR_sampling_cov INT [only valid when '--sampleBam true']
  Downsample cWGS (stLFR) bam to the specified coverage. [30]

--PF_sampling_cov INT [only valid when '--sampleBam true']
  Downsample PCRFree bam to the specified coverage. [40]

Merge the bam

--PF_lt_stLFR_depth INT
  Extract the intersection regions from the sampled cWGS (stLFR) bam with depth greater than (>) this value and PCRFree bam with depth less equal than (<=) this value. [10]

Enable resuming the running

--keepFiles BOOL
  By default, useless intermediate files will be deleted during the analysis to save storage. If you want to resume the run, set it true. [false]

Debug mode (if you plan to rerun with different parameter setting to tune the results, use -debug)

-debug
By default, each process only keeps the output files. If you want to check the intermediate files within a process, use this flag.

Executor and MegaBOLT setting, four combinations: Make sure CWGS is in your PATH.
1. on clusters by SGE (Sun Grid Engine) and no MegaBOLT (default) Confirm the working queue and project number, which can be specified using --queue, and --project for regular queue, and project id, respectively. Use "--project none" if the system doesn't support a project id. E.g.
```
CWGS sample.list --queue all.q --project none > run.log 2>&1 &
```
2. on clusters by SGE with MegaBOLT nodes. Ensure the clusters contain at least one MegaBOLT queue and have a queue for them, e.g. bolt.q. Confirm the working queue and project number, which can be specified using --queue, --boltq, and --project for regular queue, MegaBOLT queue, and project id, respectively. Use "--project none" if the system doesn't support a project id. E.g.
```
CWGS sample.list -bolt --queue all.q --boltq bolt.q --project none > run.log 2>&1 &
```
3. locally run. Run with "-local" option. E.g.
```
CWGS sample.list -local > run.log 2>&1 &
```
4. locally run on a MegaBOLT machine. Run with "-local" & "-bolt" option. E.g.
```
CWGS sample.list -local -bolt > run.log 2>&1 &
```

A more detailed flow chart.

Output of the demo example

Results
All output in the ./CWGS_run folder.

The ./CWGS_run/out/report.csv is a summary report, with all intermediate metrics, results of mapping, variant calling, phasing etc.
FQ, BAM, VCF output
The FQs are in ./CWGS_run/out/<sample_name>/fq, QC by SOAPnuke.
(demo_split_*.fq.gz are the FQ after barcode deconvolution)

demo.pf.bssq
demo.pf.qc_1.fq.gz
demo.pf.qc_2.fq.gz
demo_split_1.fq.gz
demo_split_2.fq.gz
split_stat_read1.log

The BAMs in in ./CWGS_run/out/<sample_name>/align

06.lfr_highquality.txt
06.lfr_length.txt
06.lfr_per_barcode.txt
06.lfr_readpair.txt
demo.cmrg.cov
demo.lariat.cov10.intersect.bed
demo.lariat.dv.vcf.gz
demo.lariat.dv.vcf.gz.tbi
demo.lariat.merge.bam
demo.lariat.merge.bam.bai
demo.lfr.report
demo.merge.cmrg.hist.bed
demo.merge.cmrg.mean.bed
demo.pf.bwa.dv.vcf.gz
demo.pf.bwa.dv.vcf.gz.tbi
demo.pf.cmrg.hist.bed
demo.pf.cmrg.mean.bed
demo.pf.megaboltbwabqsr.bam
demo.pf.megaboltbwabqsr.bam.bai
demo.stlfr.lariat.biobambam2.bam
demo.stlfr.lariat.biobambam2.bam.bai

The phased VCF in in ./CWGS_run/out/<sample_name>/phase

demo.lariat.dv.hapblock
demo.lariat.dv.hapcut_stat.txt
demo.lariat.dv.phase.report
demo.lariat.dv.phased.vcf.gz
demo.lariat.dv.phased.vcf.gz.tbi

Log file

The run.log shows excution information etc.
For a typical run of 1 sample, with 30x cWGS and 40x PCR free library, with 60CPU:
MegaBOLT: ~14hr
non-MegaBOLT: ~45hr
(Run time can be reduced by a batch run of multiple N samples, total time <= N*time_per_sample)
The ./CWGS_run/report.html is output of nextflow, with runtime, CPU usage etc.
The ./CWGS_run/trace.txt shows excution of each steps, use trace.txt to find intermediate files/folders.

Reference

Lariat
A Linked-Read Alignment Tool
Deepvariant
A deep learning-based variant caller
Hapcut2
A haplotype assembly tool
SOAPnuke
A novel quality control tool
MegaBOLT
A Bioinformatics analysis accelerator
cWGS/stLFR
A DNA cobarcoding technique

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
CWGS_run		CWGS_run
images		images
modules		modules
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompleteWGS (cWGS)

Requirements

Installation

Run the pipeline

Output of the demo example

Reference

About

Releases

Packages

Contributors 3

Languages

License

CGI-stLFR/CompleteWGS

Folders and files

Latest commit

History

Repository files navigation

CompleteWGS (cWGS)

Requirements

Installation

Run the pipeline

Output of the demo example

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages