-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange difference in overlapping positions between make_prg versions #55
Comments
Thanks for the bug description, it is complete, I don't think I need any more info. Retelling the story in other wordsThis is just for a better understanding. I can see local genotyping was used here, so in the first case:
What I think happened is:
For the 0.4.0 version:
Is the issue on make_prg or pandora?I actually think the issue is on Does genotyping only the differences solves this and similar cases?For this particular case, yes, I think genotyping only the differences would solve it. But with local genotyping, we could easily make a similar example where we would still have an issue. Let's say that There is a bug on the gaps computationIf we look at this record:
The gaps for the first allele is 0.5, which I think is impossible given that we just have 1 base here. We should fix this. Thoughts on local genotypingI am not sure about the state of local genotyping. When we saw that global genotyping was better for E. coli, we deprecated local genotyping. We might have bugs we need to solve, or not... we need to revisit. DiscussionGenotyping is a crucial step in |
|
I would argue this is also a make_prg issue. Shouldn't all three positions be a bubble as they have overlapping positions?
But they do because the GAPS influence the likelihood, hence why 513G>T (v0.3.0) seems a better choice than 511GGC>GGT (v0.4.0).
But the coverage on a base is made up for the mean of minimizers no? As such, it makes total sense as with a kmer size of 15, the coverage on this position is likely influenced by multiple minimzers, with some of those likely coming from the SNP upstream at position 513, thus creating a hole in coverage. I really don't think pandora's genotyping is too bad here, I just think the PRG is confusing. |
I'm seeing these annoying duplications of positions a bit lately. And they're causing me to miss calls in drprg. A recent example I have. Here is an example of the graph (in VCF format)
This position covers the S315T mutation (allele 3 is most common, but allele 2 is also S315T) in katG, which is SUPER important and common. I have an example where accession ERR2516571 has this S315T mutation, plus a novel mutation I317V. As this is close, when the new mutation is added, we would expect it to be included in the position above. This does happen, but something weird also happens
There is a duplication of the position 1044 with a SNP for some weird reason. That position has no depth, and gets genotyped as null. But the "correct" 1044 position gets genotyped as ref (assumably to make genotypes compatible?). We can see that pandora gets depth on the correct allele (allele 3), but doesn't call that genotype because of the confusing duplicate position. Here is what de novo finds
@leoisl do you know why make_prg is making these weird duplicated positions? |
I don't think this is a First thing to note is that the PRG string corresponds to the PRG GFA, so no bugs there. I also excluded the flanking nodes of this site in the GFA, but they are basically flanked by two long nodes. If we take a look at the MSAs, we have 5 different strings (sorry that So my conclusion is that the PRG string corresponds to the PRG GFA which corresponds to the MSA, i.e. the PRG is correctly representing the information we have in the MSA. Thus I don't think this is a |
The PRG string is technically representing this correctly, but my issue is I think the representation is somewhat flawed. With a minimum match length of 5, we should basically have a single site with 5 different alleles for this no? I don't get why the 1044 G A site is separate, it just complicates a relatively simple scenario. |
I agree, although the graph is correct, I think I found a bug in
From my debugging, this should not be the reason. When we reach this point in the recursion, with a minimum match length of 5 or even 6, we will want to cluster these sequences and recursively make prg on the subclusters because: 1) all sequences are large enough to cluster: make_prg/make_prg/from_msa/cluster_sequences.py Lines 229 to 231 in c2c7ff0
2) we have 5 different sequences to cluster, not two or less that we can skip clustering:
In order to cluster then we get all distinct kmers, describe each sequence as a count matrix of these distinct kmers and call kmeans clustering, etc. With minimum match length of 6, we will have 5 different 6-mers, each sequence being one different 6-mer, and they will all be clustered into singleton clusters, which will make us list all 5 alleles directly, as you'd want the description to be. With minimum match length of 5, however, the situation is a bit more complicated, as the alleles share kmers and then we might get two or more clusters. However, there is a small bug with the function that decides if we should cluster or not: make_prg/make_prg/from_msa/cluster_sequences.py Lines 107 to 111 in c2c7ff0
Basically, this function check if all clusters are one-ref like, i.e. if we can get a consensus sequence
With
However, I think that, in general, whenever we are working with a (sub-)MSA, we can safely deduplicate the alleles. This would not only solve this issue, but also any other future issue with duplicated alleles, and will also result in a much better performance, as the gene MSAs in practice have very similar regions. I have just implemented this in this branch (see 71b90f3), and tested on this |
Thanks for the detailed response. All of this makes sense. Deduplicating the alleles does seem like the better way to approach things I think. If you tested it out on that katG example then I think that's fine. I'll have a dig through my results and see if I can find any other cases of this and test out your branch. Thanks a lot Leandro! |
Hmmm, so this has popped it's head up a number of times now in pncA and rpoB and gid. Any idea when there might be a binary available with your fix @leoisl so I can test it out? |
Hey @mbhall88 , you can download the binary here: https://www.dropbox.com/s/nl5q3zvmoaj86is/make_prg_0.4.1?dl=1 . Note that this is only the binary with my changes in this branch to try to fix this issue. I haven't had time to make unit tests and make this a stable version and properly release it, so be aware these changes are still untested... but I guess you will get a feeling if this is working given that you have examples where this issue happens and you can do a sort of regression test looking at the precision-recall of drprg using this new make-prg version. Sorry for not being able to release this, all my time is being devoted to roundhound as we have some 3 deadlines coming up, with the earliest one being in 7 days, and there are still many things to do... OFC I can reprioritise things WRT what @iqbal-lab thinks should be the priority |
I've tested that out on two samples and I still get the same problem. Here is one of the examples that has a POS duplication at rpoB POS 1389 in
|
I have recently tested drprg out after upgrading make_prg from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 to the latest v0.4.0
The results were almost the same across ~8500 samples. However, there were 3 that have gone from being TPs to FNs.
It has to do with a region in the gene gid where the positions have weird overlaps that differ between make_prg versions. This is all a bit easier if I put in some examples
Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)
However, this is the same region with the latest v0.4.0 make_prg
you can see the POS 511 variant is filtered due to FRS - there is coverage on the ref and alt, where the same variant (POS 513) in the previous version has no coverage on the ref.
The second variant in those examples overlaps both the first and third variants.
This 513C>T variant is discovered by denovo in both cases - i.e. the reference PRG is effectively just the first two variants
In both cases I am using pandora v0.10.0-alpha.0
The make_prg command is:
For the v0.3.0 version (you can find all the drprg/pandora/make_prg files at
/hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499
)and for the v0.4.0 version (all files are at
/hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/
)Let me know if you need any other info @leoisl
The text was updated successfully, but these errors were encountered: