Question: Finding driver genes from a cancer dataset
2
gravatar for banerjeeshayantan
20 months ago by
banerjeeshayantan140 wrote:

I have 10 normal-tumor matched samples of pancreatic cancer. I have generated the vcfs by comparing the tumor-normal samples. Now I have annotated the vcfs to know which variants fall inside which gene. I am now comparing the gene names from the annotated vcfs with the driver gene database to find how many driver genes are present in my samples. After finding all such driver genes from my 10 samples, I can do some set operations to find out which are the commonly mutated genes . I know it's a very naive way to go about it and there are tools in place to find these genes. But is my approach correct?

sequencing assembly • 832 views
ADD COMMENTlink modified 20 months ago by Kevin Blighe50k • written 20 months ago by banerjeeshayantan140
2

It's not good enough to simply have a somatic mutation in a driver gene - the mutation has to modify the function of the gene. Good to remember this. There are tools that can gauge the likelihood of a somatic mutation doing this, such as GWAVA and FunSeq2: A: pathogenicity predictors of cancer mutations

Also, a driver gene will most likely have been called a high frequency in your samples. This is due to the fact that drivers, being drivers, are found in dominant tumour clones. Mutations called at lower frequencies most likely represent lesser tumour clones that will 'lose out' in the cancer evolution model.

You should also compare your genes and mutations against the chromosomal instability (CIN) gene signatures, such as CIN4, CIN25, and CIN70:

Which gene database specifically are you using?

ADD REPLYlink modified 20 months ago • written 20 months ago by Kevin Blighe50k

I have 10 normal/tumor pairs of mice dataset suffering from PDAC. So it was quite difficult finding putative drivers for mice tumors. But I came across this database (http://ccgd-starrlab.oit.umn.edu/about.php) that has cancer-wise mice drivers. Also one of my observations was that some of the most common driver genes that we expect in pancreatic cancer(like KRAS,TP53 etc) were absent when I did my above analysis. Any thoughts? (P.S: This is just an exploratory analysis to find out (in a very rough sense), what all genes that have variants in my samples, are actually drivers).

ADD REPLYlink written 20 months ago by banerjeeshayantan140

You literally found no mutations in KRAS or TP53? - with a low sample number, I guess that it's possible. What about BRAF?

What I recommend that you do is find the mutations with the highest frequency across all 10 samples. This can be as easy as merging your somatic VCFs together using BCFtools and then using one of my scripts here:

Prioritising them by frequency may make your job a little easier.

ADD REPLYlink written 20 months ago by Kevin Blighe50k

Thanks for your reply. As you were asking, I tried Braf too, with no luck. So after merging I will get one single multi sampled vcf file. After this you suggested running your code, and I will get the frequencies. Now I have to annotate again and find which genes does these high frequency mutations fall into, right?

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Yes, you can merge with BCFtools as follows (best to normalise each first):

bcftools norm -Ov -m-any SomaticVCF1.vcf > SomaticVCF1.norm.vcf
bcftools norm -Ov -m-any SomaticVCF2.vcf > SomaticVCF2.norm.vcf
...

bcftools merge -Ob -m none SomaticVCF1.norm.vcf SomaticVCF2.norm.vcf ... > SomaticMerge.vcf

As you only have 10 samples, you can just do this as a batch-like process (as opposed to a loop).

How did you annotate these? - Variant Effect Predictor? In that case, you can modify my AWK script on the other threads in order to output in addition the annotated column from the VCF

ADD REPLYlink written 20 months ago by Kevin Blighe50k

I used SnpEff to annotate these. Can I generate the annotated vcfs of all the samples and then merge them?Does that help?

ADD REPLYlink modified 20 months ago • written 20 months ago by banerjeeshayantan140

You should be able to do that, yes. SnpEff places the annotation within the VCF itself, right? It has been years since I used Snpff

ADD REPLYlink written 20 months ago by Kevin Blighe50k
1

Yeah it does exactly that. Many thanks for your help. Highly appreciated.

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Hi Kevin, sorry for bothering you once again. I first generated all the annotated files separately and then merged them. Following are the problems I faced, can you please help? 1. The total number of mutations in the merged vcf doesn't add up to the individual mutations. I have a deficit of 736 mutations in the final merged file. 2. After I used your code to find the frequency, I get a maximum frequency of 8. Ranges:1-8. Isn't this too low? Any suggestions?

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Can you paste the commands that you are using?

ADD REPLYlink written 20 months ago by Kevin Blighe50k

Sorry for the late reply. Here goes:

bcftools merge -Ob -m none mut103_anno.vcf.gz mut109_anno.vcf.gz mut137_anno.vcf.gz mut149_anno.vcf.gz mut157_anno.vcf.gz mut167_anno.vcf.gz mut179_anno.vcf.gz mut191_anno.vcf.gz mut197_anno.vcf.gz mut227_anno.vcf.gz > merge.vcf  
And then:  
bcftools view merge.vcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tHetCount"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"")}' > snp.txt  
This is a part of the snp.txt file:(Notice the very less frequency per mutation)  
 V1      V2 V3  V4  V5 V6  
1   CHR     POS ID REF ALT  HetCount  
2 chr10 3750835  .   G   T  2  
3 chr10 3786973  .   T   A  2  
4 chr10 4540351  .   G   T  2  
5 chr10 4734865  .   T   G  2  
6 chr10 5068992  .   T   G  2
ADD REPLYlink modified 20 months ago by Kevin Blighe50k • written 20 months ago by banerjeeshayantan140

Oh, can you try this instead?

bcftools merge -Ob -m-any mut103_anno.vcf.gz mut109_anno.vcf.gz mut137_anno.vcf.gz mut149_anno.vcf.gz mut157_anno.vcf.gz mut167_anno.vcf.gz mut179_anno.vcf.gz mut191_anno.vcf.gz mut197_anno.vcf.gz mut227_anno.vcf.gz > merge.bcf  

bcftools view merge.bcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tnHet\tnHomAlt\tnHomRef"} \
    !/^#/ {hetCount=gsub(/0\|1|1\|0|0\/1|1\/0/,""); \
    homCountAlt=gsub(/1\|1|1\/1/,""); \
    homCountRef=gsub(/0\|0|0\/0/,""); \
    print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"hetCount"\t"homCountAlt"\t"homCountRef}') > snp.txt

Also, where has SnpEff annotated the variants?

ADD REPLYlink written 20 months ago by Kevin Blighe50k

SnpEff has annotated the vcfs and created a separate file. Does that answer your question?

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Also I tried the code that you shared and it didn't give much improvement over the previous numbers. For each of the three columns the frequency ranged from 0 to 6,0 to 7 and 0 to 3.

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Numbers less than 10 should be expected because not every somatic mutation will be called in every sample. The new code (above) just gives you information on heterozygous and homozygous calls.

If the highest number was 8, then that relates to 80% of your samples for that particular variant.

ADD REPLYlink written 20 months ago by Kevin Blighe50k

Got it. Just to clarify, your code is calculating the frequency of a variant being called in throughout the merged vcf. So if the highest number of times a variant was repeated (in all my samples) was 8, then how does that relate to 80% of my samples? Does this mean that out of 10 samples, this variant was repeated 8 samples? I didn't quite get the last part. Apologies if I am being repetitive. I am really trying to grasp this concept.

ADD REPLYlink modified 20 months ago • written 20 months ago by banerjeeshayantan140
1

Sorry, this code was originally written for germline variants and it is just counting 0/1 (het), 1/1 (hom), and 0/0 (ref hom) per variant. As this is cancer, the total number of samples in which a particular somatic mutation is called will be the sum of hetCount and homCountAlt. For the majority of your somatic mutations, this sum will not equal 10, i.e., it will not equal the total number of your samples. If, however, hetCount+homCountAlt=8, then the frequency of that particular somatic mutation is:

8/10 * 100 = 80%

Regarding the somatic mutations that are missing after this sequence of commands, can you retry after having used -m-any in bcftools norm?

ADD REPLYlink written 20 months ago by Kevin Blighe50k

Thanks for the explanation. Now it's clear. So what you are asking me to do is first annotate all the vcfs then normalize and finally merge? Is that the correct order?

ADD REPLYlink written 20 months ago by banerjeeshayantan140

Sorry - I was traveling. It may make more sense to normalise, merge, and then annotate the merged file.

Once you then get the somatic mutation frequencies, you will have to find another way to overlap that with the annotation.

I'm just now thinking that there must be another program that can already do all of this...

ADD REPLYlink written 20 months ago by Kevin Blighe50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1237 users visited in the last hour