Variant Filtration And Surprising Results
3
9
Entering edit mode
7.8 years ago
ao.carson ▴ 90

Greetings,

I have been working on a comparison of a human endocrine cancer, matched with normal tissue from the same organ. The tissue was dissected by a pathologist, and is considered quite pure. We sequenced to 80x depth in the tumor 40x in the normal. Somatic variant calling was done with Varscan2 (samtools -q 1 to use only mapped reads, and the high confidence set). Indels and SNPs were filtered and annotated; we are looking at coding, non-synonymous mutations and splice site mutations. In our small cohort of 15 patients, we looked for those mutations that are in more than one tumor sample. Using this list, I was pretty surprised; about 1/2 the variants are found in dnSNP 1.3.7 and most in 1.3.1. About 1/4 are in the COSMIC database (although some of those are in dbSNP as well). I pressed on to manual verification in IGV.

The surprises continue; many of the sites called as a somatic mutation are in fact germline; there are more reads in the tumor, but often times in the normal the variant allele frequency is similar or the same. Many of the sites are in dbSNP. Many are in low-coverage regions.

I feel like I did what I could to get quality, somatic mutations out of varscan. I followed the entire GATK pipeline to indel-realign, recal, etc. I used only mapped reads in samtools when piping into varscan. And from varscan, we used the high confidence set. Yet everything is such a soft call... a 'somatic' mutation in one patient has some reads in the normal, and other samples have the variant as a polymorphism as well. I tried to use the preparation and variant calling pipeline to solve this problem, yet here I am at the end, feeling like I'm back to square one. Is there something obvious that I'm missing? (I have also used Mutect, Strelka, GATK, SomaticSniper, but the number of false positives seemed to be the least with Varscan2).

Thank you for the feedback, AOC

dbsnp somatic variant • 7.9k views
0
Entering edit mode

AOC,

I recently ran into a similar problem. I used MuTect to produce a somatic variant callset but noticed that many of the variants seem to come from reads with low mapping quality. After filtering for variant read MQ, the remaining variants seemed to be of high quality. We then went on to validation of the most interesting variants only to discover that many of those variants were due to systematic error in the sequencing technology; the variants which we thought were somatic were present at very low frequency in all samples (tumor and normal). To correct this in the future, we are filtering out variants in our somatic callset that are present in any normal sample or dbSNP. We are also annotating variants in high-GC regions, as the false positives seem to consistently come from those regions.

I would guess that many of your variants are likely due to systematic error as well, and the increased frequency of those variants in the tumor sample is due to chance rather than anything of biological significance. Probably the best and easiest solution is to filter out any called somatic variants present in any normal samples, although this may be a little drastic depending on what you are looking for.

2
Entering edit mode
7.8 years ago

curl -LO https://github.com/ckandoth/variant-filter/archive/master.zip
unzip master.zip
perl variant-filter-master/fpfilter.pl --help

It mentions that you need to first run bam-readcount on all your variant loci, and pass the result to the script above. And it only works on point mutations for now. WashU has this filtering script for indels, but it is not yet portable like fpfilter.pl above.

0
Entering edit mode

Is there any portable script like fpfilter.pl yet for INDELS  false positive  removal from VarScan INDEL vcf file?

1
Entering edit mode
7.8 years ago

You can solve the low coverage problem by being more stringent with the --min-coverage, --min-coverage-normal, and --min-coverage-tumor parameters in VarScan. My guess is that this will make the biggest difference - you can be super conservative with this, if you really want.

If you haven't done so already, you can also set --p-value to 0.05 (not just --somatic-p-value, which is the default).

http://varscan.sourceforge.net/using-varscan.html#v2.3_somatic

0
Entering edit mode

Thanks for your response cwarden. I initially used a normal coverage of 6 and tumor coverage of 8. I am running into 'low-coverage' sites being 15 reads in the normal, with 2 being variant, and 35 reads in the tumor, with 5 being variant. I don't understand how that is coming up as a high-confidence somatic mutation. It looks germline, or at least not as 'pure' as I thought my calls would be. I can try re-running it with the --p-value set lower that 99%. The way I read the manual I thought the somatic-p-value was the second step and would filter out those bad calls that made it past the 0.99 p-value. Thank you for your help.

1
Entering edit mode

You want the p-values to be lower rather than higher (so, 0.01 not 0.99).

Also, at the risk of asking the obvious, are you specifically looking at the "Somatic" and "LOH" variants? The output file also contains "Germline" variants.

I believe simply setting --p-value to 0.05 has been sufficient to give users good results in the past. Not sure why you are encountering problems.