Question

Finding driver genes from a cancer dataset

2

Entering edit mode

6.2 years ago

Gene_MMP8 ▴ 240

I have 10 normal-tumor matched samples of pancreatic cancer. I have generated the vcfs by comparing the tumor-normal samples. Now I have annotated the vcfs to know which variants fall inside which gene. I am now comparing the gene names from the annotated vcfs with the driver gene database to find how many driver genes are present in my samples. After finding all such driver genes from my 10 samples, I can do some set operations to find out which are the commonly mutated genes . I know it's a very naive way to go about it and there are tools in place to find these genes. But is my approach correct?

Assembly sequencing • 2.1k views

ADD COMMENT • link updated 6.2 years ago by Kevin Blighe 87k • written 6.2 years ago by Gene_MMP8 ▴ 240

Kevin Blighe · Answer 1 · 2018-02-21

2

Entering edit mode

6.2 years ago

Kevin Blighe 87k

It's not good enough to simply have a somatic mutation in a driver gene - the mutation has to modify the function of the gene. Good to remember this. There are tools that can gauge the likelihood of a somatic mutation doing this, such as GWAVA and FunSeq2: A: pathogenicity predictors of cancer mutations

Also, a driver gene will most likely have been called a high frequency in your samples. This is due to the fact that drivers, being drivers, are found in dominant tumour clones. Mutations called at lower frequencies most likely represent lesser tumour clones that will 'lose out' in the cancer evolution model.

You should also compare your genes and mutations against the chromosomal instability (CIN) gene signatures, such as CIN4, CIN25, and CIN70:

Which gene database specifically are you using?

ADD COMMENT • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I have 10 normal/tumor pairs of mice dataset suffering from PDAC. So it was quite difficult finding putative drivers for mice tumors. But I came across this database (http://ccgd-starrlab.oit.umn.edu/about.php) that has cancer-wise mice drivers. Also one of my observations was that some of the most common driver genes that we expect in pancreatic cancer(like KRAS,TP53 etc) were absent when I did my above analysis. Any thoughts? (P.S: This is just an exploratory analysis to find out (in a very rough sense), what all genes that have variants in my samples, are actually drivers).

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

You literally found no mutations in KRAS or TP53? - with a low sample number, I guess that it's possible. What about BRAF?

What I recommend that you do is find the mutations with the highest frequency across all 10 samples. This can be as easy as merging your somatic VCFs together using BCFtools and then using one of my scripts here:

Prioritising them by frequency may make your job a little easier.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for your reply. As you were asking, I tried Braf too, with no luck. So after merging I will get one single multi sampled vcf file. After this you suggested running your code, and I will get the frequencies. Now I have to annotate again and find which genes does these high frequency mutations fall into, right?

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Yes, you can merge with BCFtools as follows (best to normalise each first):

bcftools norm -Ov -m-any SomaticVCF1.vcf > SomaticVCF1.norm.vcf
bcftools norm -Ov -m-any SomaticVCF2.vcf > SomaticVCF2.norm.vcf
...

bcftools merge -Ob -m none SomaticVCF1.norm.vcf SomaticVCF2.norm.vcf ... > SomaticMerge.vcf

As you only have 10 samples, you can just do this as a batch-like process (as opposed to a loop).

How did you annotate these? - Variant Effect Predictor? In that case, you can modify my AWK script on the other threads in order to output in addition the annotated column from the VCF

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I used SnpEff to annotate these. Can I generate the annotated vcfs of all the samples and then merge them?Does that help?

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

You should be able to do that, yes. SnpEff places the annotation within the VCF itself, right? It has been years since I used Snpff

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

1

Entering edit mode

Yeah it does exactly that. Many thanks for your help. Highly appreciated.

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Hi Kevin, sorry for bothering you once again. I first generated all the annotated files separately and then merged them. Following are the problems I faced, can you please help? 1. The total number of mutations in the merged vcf doesn't add up to the individual mutations. I have a deficit of 736 mutations in the final merged file. 2. After I used your code to find the frequency, I get a maximum frequency of 8. Ranges:1-8. Isn't this too low? Any suggestions?

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Can you paste the commands that you are using?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry for the late reply. Here goes:

bcftools merge -Ob -m none mut103_anno.vcf.gz mut109_anno.vcf.gz mut137_anno.vcf.gz mut149_anno.vcf.gz mut157_anno.vcf.gz mut167_anno.vcf.gz mut179_anno.vcf.gz mut191_anno.vcf.gz mut197_anno.vcf.gz mut227_anno.vcf.gz > merge.vcf  
And then:  
bcftools view merge.vcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tHetCount"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" gsub(/0\|1|1\|0|0\/1|1\/0/,"")}' > snp.txt  
This is a part of the snp.txt file:(Notice the very less frequency per mutation)  
 V1      V2 V3  V4  V5 V6  
1   CHR     POS ID REF ALT  HetCount  
2 chr10 3750835  .   G   T  2  
3 chr10 3786973  .   T   A  2  
4 chr10 4540351  .   G   T  2  
5 chr10 4734865  .   T   G  2  
6 chr10 5068992  .   T   G  2

ADD REPLY • link updated 6.2 years ago by Kevin Blighe 87k • written 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Oh, can you try this instead?

bcftools merge -Ob -m-any mut103_anno.vcf.gz mut109_anno.vcf.gz mut137_anno.vcf.gz mut149_anno.vcf.gz mut157_anno.vcf.gz mut167_anno.vcf.gz mut179_anno.vcf.gz mut191_anno.vcf.gz mut197_anno.vcf.gz mut227_anno.vcf.gz > merge.bcf  

bcftools view merge.bcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tnHet\tnHomAlt\tnHomRef"} \
    !/^#/ {hetCount=gsub(/0\|1|1\|0|0\/1|1\/0/,""); \
    homCountAlt=gsub(/1\|1|1\/1/,""); \
    homCountRef=gsub(/0\|0|0\/0/,""); \
    print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"hetCount"\t"homCountAlt"\t"homCountRef}') > snp.txt

Also, where has SnpEff annotated the variants?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

SnpEff has annotated the vcfs and created a separate file. Does that answer your question?

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Also I tried the code that you shared and it didn't give much improvement over the previous numbers. For each of the three columns the frequency ranged from 0 to 6,0 to 7 and 0 to 3.

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Numbers less than 10 should be expected because not every somatic mutation will be called in every sample. The new code (above) just gives you information on heterozygous and homozygous calls.

If the highest number was 8, then that relates to 80% of your samples for that particular variant.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Got it. Just to clarify, your code is calculating the frequency of a variant being called in throughout the merged vcf. So if the highest number of times a variant was repeated (in all my samples) was 8, then how does that relate to 80% of my samples? Does this mean that out of 10 samples, this variant was repeated 8 samples? I didn't quite get the last part. Apologies if I am being repetitive. I am really trying to grasp this concept.

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

1

Entering edit mode

Sorry, this code was originally written for germline variants and it is just counting 0/1 (het), 1/1 (hom), and 0/0 (ref hom) per variant. As this is cancer, the total number of samples in which a particular somatic mutation is called will be the sum of hetCount and homCountAlt. For the majority of your somatic mutations, this sum will not equal 10, i.e., it will not equal the total number of your samples. If, however, hetCount+homCountAlt=8, then the frequency of that particular somatic mutation is:

8/10 * 100 = 80%

Regarding the somatic mutations that are missing after this sequence of commands, can you retry after having used -m-any in bcftools norm?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the explanation. Now it's clear. So what you are asking me to do is first annotate all the vcfs then normalize and finally merge? Is that the correct order?

ADD REPLY • link 6.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Sorry - I was traveling. It may make more sense to normalise, merge, and then annotate the merged file.

Once you then get the somatic mutation frequencies, you will have to find another way to overlap that with the annotation.

I'm just now thinking that there must be another program that can already do all of this...

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k