Skip to content

Latest commit

 

History

History
97 lines (70 loc) · 6.15 KB

README.md

File metadata and controls

97 lines (70 loc) · 6.15 KB

Binomial Variant Caller from Pileup

This repo contains an implementation of a simple Variant Calling algorithm that assumes that nucleotide counts at one pileup position follow the same Binomial distribution and uses Maximum Likelihood Estimation (MLE) for calling the variants.

Youtube presentation

Table of content

How to use

The algorithm takes an input file in the Pileup format, and produces a .vcf output file in the Variant Call format.

To run the tool:

  • Install pysam, numpy, scipy. TODO: Create requirements.txt
  • Run python main.py --input-file=PATH_TO_INPUT_FILE --output-file=PATH_TO_SAVE_OUTPUT --p=0.8.

To see other possible parameters, call python main.py --help.

What is Variant Calling?

Variant calling is the process of finding differences between a reference genome and an observed sample.
By comparing an observed genome with a well known reference genome, we can identify, store and query someone's genome easier.
Genomic variants can be simple, one example are single nucleotide variants (SNVs), and they can be complex, one example includes translocations.

Before identifying variants, we need to read the actual genome we want to analyze. Unfortunately, there isn't a method to read the whole genome in one go, so DNA is read by fragmenting the genome into millions of molecules and reading shorter reads instead. The reads are then aligned to reconstruct the whole genome.

The algorithm implemented in this repo identifies variants by looking at only one position at a time, for what the pileup format is perfect for.
Pileup format gives us how many of each nucleotide (A, G, C, T) are at every position, and also how many insertion/deletition occurencies are at that position. It also gives us the information what the reference genome had, against what we will compare.

Algorithm

After we have constrained the analysis to only one sequence position, we want to count how many of each AGCT letter and each INDEL (insertion/deletition) is at that position.
We will assume diploidy, hence there are at most two variants at one position. We will select between all letters and indels two with the largest count, remember their counts and discard others.

To simplify, let's call them letters a and b, with counts na and nb. We will use the Maximum Likelihood (MLE) estimation to select the variants.

P(variant|counts) = P(variant) * P(counts|variant) / P(counts)

MLE ignores P(variant), and chooses the most probable one by maximizing P(counts|variant), the likelihood function.

All nucleotides at one position are either correct given the variant, or an error due to read extraction process. We will assume that all nucleotides are correct with probability p and are independent of each other. There are three cases:

  • Variant is aa, na nucleotides are correct. Likelihood function is given by Binomial distribution with parameters na + nb and p, at position na.
  • Variant is bb, nb nucleotides are correct. Likelihood function is given by Binomial distribution with parameters na + nb and p, at position nb.
  • Variant is ab, all nucleotides are correct. We found that it is suboptimal to blend a and b into one correct letter and have the likelihood function be equal to the same Binomial distribution at position na + nb, because this option would be picked too often. To reduce that, we assumed that a and b should appear with equal probability 1/2 and calculated the likelihood as probability that the counts are na and nb in that setting.

After calculating the likelihoods, we will pick the most likely variant! That variant is compared to the referent genome, to determine the genotype and the 'alts' field to store in the VCF 4.2 format output file.

Results

We used merged-normal.bam from Seven Bridges Cancer Genomics Cloud to test the algorithm. We produced merged-normal.pileup using Bcftools Mpileup tool and compared our called variants with variants produces by Bcftools Call tool.

Having or not having a variant at one position in the Bcftools Call tool output will be our true/false labels, and comparing the output of our algorithm gives us true positives, true negatives, false positives and false negatives, which are used to calculate standard metrics.

We ran the algorithm for probabilities between 0.5 and 1, at every 0.05 increment, and the results are:

The confusion matrix for p = 0.8 (best f1 score) is as follows:

Acknowlegments

This project was done for Computational Genomics course at School of Electrical Engineering, University of Belgrade.

You can check the course github for useful learning materials.

Bcftools Mpileup tool and Bcftools Call tool were ran on Seven Bridges Cancer Genomics Cloud , where we also downloaded the reference human genome and merged-normal.bam test file.

Licence

License: MIT