Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No genes with qglobal_cv < 0.1 #69

Open
kvn95ss opened this issue Aug 23, 2021 · 7 comments
Open

No genes with qglobal_cv < 0.1 #69

kvn95ss opened this issue Aug 23, 2021 · 7 comments

Comments

@kvn95ss
Copy link

kvn95ss commented Aug 23, 2021

Hello,

I ran this data set on filtered output from Mutect2 (tumor vs normal, single patient with PoN of 4 samples). I got the mutation list by querying the vcf file from bcftools so I get the columns sampleID, chr, pos, ref and mut.

I'm using hg38 reference from the precomputed rdna file in this repo - https://github.com/im3sanger/dndscv_data/tree/master/data

I'm able to get dndscv running for my data by using these commands
cancer_test <- read.table("CC028_dmg_test.vcf")
cancer_processed_data = dndscv(cancer_test, ref_db="data/RefCDS_human_GRCh38.p12.rda", cv=NULL)
sel_cv = cancer_processed_data$sel_cv;print(head(sel_cv), digits = 3)
I get this output -

      gene_name n_syn n_mis n_non n_spl n_ind wmis_cv wnon_cv wspl_cv wind_cv
8821   KRTAP5-4     0     1     0     0     2    46.7       0       0     772
16565   TAS2R30     0     2     0     0     1    61.4       0       0     276
7412      HLA-C     0     2     0     0     1    33.4       0       0     237
13331      PSG3     0     2     0     0     1    36.7       0       0     186
9056     LILRA4     0     2     0     0     1    31.2       0       0     177
18255  USP17L18     0     1     2     0     0    13.5     357     357       0
       pmis_cv ptrunc_cv pallsubs_cv  pind_cv qmis_cv qtrunc_cv qallsubs_cv
8821  0.016677  9.61e-01    5.69e-02 7.03e-05   0.897     0.983       0.996
16565 0.000788  9.54e-01    3.55e-03 3.48e-03   0.897     0.983       0.996
7412  0.002592  9.25e-01    1.06e-02 4.03e-03   0.897     0.983       0.996
13331 0.002184  9.08e-01    8.98e-03 5.08e-03   0.897     0.983       0.996
9056  0.002970  9.14e-01    1.19e-02 5.32e-03   0.897     0.983       0.996
18255 0.067056  1.96e-05    6.39e-05 1.00e+00   0.897     0.383       0.996
      pglobal_cv qglobal_cv
8821    5.37e-05          1
16565   1.52e-04          1
7412    4.71e-04          1
13331   5.01e-04          1
9056    6.77e-04          1
18255   6.81e-04          1

But when looking for significant genes, I get no output
print(cancer_processed_data$sel_cv[cancer_processed_data$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])

<0 rows> (or 0-length row.names)

What could be the reason for this? Does this imply there are no significant genes in the data?

@shaghayeghsoudi
Copy link

Hey, have you been able to find an answer for your question? I am running into the exact same problem and getting no hit. Thanks

@im3sanger
Copy link
Owner

Hello,

Sorry for the very late reply.

Yes, this means that there are no recurrently mutated genes in your dataset reaching statistical significance. Can you explain your experimental design in more detail? From your earlier description it sounds like you are analysing data from a single patient. Is that correct? In that case it would not be unexpected not to find any significant recurrence, as this relies on finding mutations in the same gene across multiple samples or patients.

Inigo

@shaghayeghsoudi
Copy link

Hi Inigo, thanks for your reply. I am indeed working on 27 WES sarcoma tumours. They are multi regional and for each tumour I have 3-6 regions sampled and sequenced which I am merging them into one for each tumour by removing duplicate mutations. I was expecting to find at least a few hits as sarcomas are not normally SSMs type of tumours but I am getting all q-values equal to one, nothing significant.

@im3sanger im3sanger reopened this Oct 21, 2022
@im3sanger
Copy link
Owner

Hello,

Thank you. Apologies, I had not realised that there were questions from separate users.

Can you confirm what value of theta you are getting? (dndsout$nbreg$theta).

Lack of significance can be caused by datasets that are too small or that do not have sufficient recurrence for any gene to reach significance. However, it is always important to check that your theta value is not very low (<<1). Very low theta values mean that there is very high variation in the density of synonymous mutations across genes. This typically reflects problems with the mutation calls, such as recurrent artefacts or SNP contamination. Large variation in the density of mutations across genes (high overdispersion) makes dNdScv be more conservative (a gene needs to have more mutations to emerge from the noise) and results in less significance.

If your dataset has good theta values (>1, or ideally >3) and your mutation calls are reliable, then the lack of significance may reflect insufficient power (small datasets or insufficient recurrence).

Best,
Inigo

@ym-chen
Copy link

ym-chen commented Aug 1, 2024

Hello,

Thank you. Apologies, I had not realised that there were questions from separate users.

Can you confirm what value of theta you are getting? (dndsout$nbreg$theta).

Lack of significance can be caused by datasets that are too small or that do not have sufficient recurrence for any gene to reach significance. However, it is always important to check that your theta value is not very low (<<1). Very low theta values mean that there is very high variation in the density of synonymous mutations across genes. This typically reflects problems with the mutation calls, such as recurrent artefacts or SNP contamination. Large variation in the density of mutations across genes (high overdispersion) makes dNdScv be more conservative (a gene needs to have more mutations to emerge from the noise) and results in less significance.

If your dataset has good theta values (>1, or ideally >3) and your mutation calls are reliable, then the lack of significance may reflect insufficient power (small datasets or insufficient recurrence).

Best, Inigo

Hello,

I encountered the same issue. My samples come from multiple tissue sites of several patients. I used MuTect2 to obtain a set of somatic variants. However, when I used dNdScv to look for driver genes, the qglobal_cv for all genes is close to 1. I noticed that the result shows θ=3.881757. I find this quite confusing.

@im3sanger
Copy link
Owner

Hello ym-chen,

Thanks for your message. Could you clarify how many samples you are analysing here? Lack of significance does not necessarily mean that there are not drivers in your dataset but that there is not enough evidence to reach statistical significance. This could be due to insufficient power if your dataset is too small.

Best,
Inigo

@ym-chen
Copy link

ym-chen commented Dec 17, 2024

@im3sanger Sorry for the late reply. I have 11 individuals, each with about 30 microdissection sites. Each site underwent WES sequencing. In my analysis, I treated the individuals as units. I suspect that the reason I couldn't obtain significant results is that I filtered out too many mutations while selecting reliable mutation sites in the preliminary phase, which resulted in very scattered data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants