I have 10 normal-tumor matched samples of pancreatic cancer. I have generated the vcfs by comparing the tumor-normal samples. Now I have annotated the vcfs to know which variants fall inside which gene. I am now comparing the gene names from the annotated vcfs with the driver gene database to find how many driver genes are present in my samples. After finding all such driver genes from my 10 samples, I can do some set operations to find out which are the commonly mutated genes . I know it's a very naive way to go about it and there are tools in place to find these genes. But is my approach correct?
I have 10 normal/tumor pairs of mice dataset suffering from PDAC. So it was quite difficult finding putative drivers for mice tumors. But I came across this database (http://ccgd-starrlab.oit.umn.edu/about.php) that has cancer-wise mice drivers. Also one of my observations was that some of the most common driver genes that we expect in pancreatic cancer(like KRAS,TP53 etc) were absent when I did my above analysis. Any thoughts? (P.S: This is just an exploratory analysis to find out (in a very rough sense), what all genes that have variants in my samples, are actually drivers).
You literally found no mutations in KRAS or TP53? - with a low sample number, I guess that it's possible. What about BRAF?
What I recommend that you do is find the mutations with the highest frequency across all 10 samples. This can be as easy as merging your somatic VCFs together using BCFtools and then using one of my scripts here:
Prioritising them by frequency may make your job a little easier.
Thanks for your reply. As you were asking, I tried Braf too, with no luck. So after merging I will get one single multi sampled vcf file. After this you suggested running your code, and I will get the frequencies. Now I have to annotate again and find which genes does these high frequency mutations fall into, right?
Yes, you can merge with BCFtools as follows (best to normalise each first):
As you only have 10 samples, you can just do this as a batch-like process (as opposed to a loop).
How did you annotate these? - Variant Effect Predictor? In that case, you can modify my AWK script on the other threads in order to output in addition the annotated column from the VCF
I used SnpEff to annotate these. Can I generate the annotated vcfs of all the samples and then merge them?Does that help?
You should be able to do that, yes. SnpEff places the annotation within the VCF itself, right? It has been years since I used Snpff
Yeah it does exactly that. Many thanks for your help. Highly appreciated.
Hi Kevin, sorry for bothering you once again. I first generated all the annotated files separately and then merged them. Following are the problems I faced, can you please help? 1. The total number of mutations in the merged vcf doesn't add up to the individual mutations. I have a deficit of 736 mutations in the final merged file. 2. After I used your code to find the frequency, I get a maximum frequency of 8. Ranges:1-8. Isn't this too low? Any suggestions?
Can you paste the commands that you are using?
Sorry for the late reply. Here goes:
Oh, can you try this instead?
Also, where has SnpEff annotated the variants?
SnpEff has annotated the vcfs and created a separate file. Does that answer your question?
Also I tried the code that you shared and it didn't give much improvement over the previous numbers. For each of the three columns the frequency ranged from 0 to 6,0 to 7 and 0 to 3.
Numbers less than 10 should be expected because not every somatic mutation will be called in every sample. The new code (above) just gives you information on heterozygous and homozygous calls.
If the highest number was 8, then that relates to 80% of your samples for that particular variant.
Got it. Just to clarify, your code is calculating the frequency of a variant being called in throughout the merged vcf. So if the highest number of times a variant was repeated (in all my samples) was 8, then how does that relate to 80% of my samples? Does this mean that out of 10 samples, this variant was repeated 8 samples? I didn't quite get the last part. Apologies if I am being repetitive. I am really trying to grasp this concept.
Sorry, this code was originally written for germline variants and it is just counting 0/1 (het), 1/1 (hom), and 0/0 (ref hom) per variant. As this is cancer, the total number of samples in which a particular somatic mutation is called will be the sum of hetCount and homCountAlt. For the majority of your somatic mutations, this sum will not equal 10, i.e., it will not equal the total number of your samples. If, however, hetCount+homCountAlt=8, then the frequency of that particular somatic mutation is:
Regarding the somatic mutations that are missing after this sequence of commands, can you retry after having used
-m-any
inbcftools norm
?Thanks for the explanation. Now it's clear. So what you are asking me to do is first annotate all the vcfs then normalize and finally merge? Is that the correct order?
Sorry - I was traveling. It may make more sense to normalise, merge, and then annotate the merged file.
Once you then get the somatic mutation frequencies, you will have to find another way to overlap that with the annotation.
I'm just now thinking that there must be another program that can already do all of this...