Skip to content

Setting up the config.yaml file

Richard de Borja edited this page Oct 7, 2020 · 2 revisions

A configuration file is used to setup parameters for the QC pipeline. These parameters are used to customize a QC run based on the location of files and directories, specific platform used to sequence the samples, and application parameters.

Example config.yaml file

# path to the top-level directory containing the analysis results
data_root: run_200430

# optionally the plots can have a "run name" prefix. If this is not defined the prefix will be "default"
run_name: my_run

# path to the file containing the amplicon regions (not the primer sites, the actual amplicons)
amplicon_bed: resources/artic_amplicons.bed

# path to the nCov reference genome
reference_genome: resources/artic_reference.fasta

# the sequencing platform used, must be "oxford-nanopore" or "illumina"
platform: "oxford-nanopore"

# path to the BED file containing the primers, this should follow the format downloaded from
# the ARTIC primer
primer_bed: nCoV-2019.bed

# list the type of amplicon BED file that will be created from the "primer_bed".  This can include:
# full -- amplicons including primers and overlaps listed in the primer BED file
# no_primers -- amplicons including overlaps but with primers removed
# unique_amplicons -- distinct amplicons regions with primers and overlapping regions removed
bed_type: unique_amplicons

# offset for the amplicons and primers
offset: 0

# minimum completeness threshold for inclusion to the SNP tree plot, if no entry
# is provided the default is set to 0.75
completeness_threshold: 0.9

Optional configuration parameters:

# full path the metadata file
metadata: /path/to/metadata.tsv

# full path to the ANNOVAR SARS-CoV-2 database files for annotating variants
sarscov2db: /path/to/annovar/db/sarscov2/files

The config file is a YAML formatted file and includes Python and Snakemake style wildcards. This will allow custom filenames, although the pipeline has been designed to work with output generated from iVar (Illumina) or the artic-ncov2019/fieldbioinformatics workflow (oxford nanopore). If the platform key is set to either oxford-nanopore or illumina, filename patterns will be defaulted to the specific pipeline output, however this can always be overwritten:

# the naming convention for the bam files
# this can use the variables {data_root} (as above) and {sample}
# As per the example above, this will expand to run_200430/sampleA.sorted.bam for sampleA
bam_pattern: "{data_root}/{sample}.sorted.bam"

# the naming convention for the consensus sequences
consensus_pattern: "{data_root}/{sample}.consensus.fasta"

# the naming convention for the variants file, NF illumina runs typically use
# "{data_root}/{sample}.variants.tsv and oxford nanopore runs use "{data_root}/{sample}.pass.vcf.gz"
variants_pattern: "{data_root}/{sample}.variants.tsv

Additional metadata can be included in the summary QC file including qPCR Ct values or sample collection date. The metadata should be included as a separate file and a reference to the full path of the file can be included in the config.yaml.

metadata: "/path/to/metadata.tsv"

The metadata format must include the following fields and is tab separated:

sample   ct     date
sampleA  20.8   2020-05-01
sampleB  27.1   2020-06-02
Clone this wiki locally