Question: Variant Filtration And Surprising Results
gravatar for ao.carson
2.9 years ago by
ao.carson80 wrote:


I have been working on a comparison of a human endocrine cancer, matched with normal tissue from the same organ. The tissue was dissected by a pathologist, and is considered quite pure. We sequenced to 80x depth in the tumor 40x in the normal. Somatic variant calling was done with Varscan2 (samtools -q 1 to use only mapped reads, and the high confidence set). Indels and SNPs were filtered and annotated; we are looking at coding, non-synonymous mutations and splice site mutations. In our small cohort of 15 patients, we looked for those mutations that are in more than one tumor sample. Using this list, I was pretty surprised; about 1/2 the variants are found in dnSNP 1.3.7 and most in 1.3.1. About 1/4 are in the COSMIC database (although some of those are in dbSNP as well). I pressed on to manual verification in IGV.

The surprises continue; many of the sites called as a somatic mutation are in fact germline; there are more reads in the tumor, but often times in the normal the variant allele frequency is similar or the same. Many of the sites are in dbSNP. Many are in low-coverage regions.

I feel like I did what I could to get quality, somatic mutations out of varscan. I followed the entire GATK pipeline to indel-realign, recal, etc. I used only mapped reads in samtools when piping into varscan. And from varscan, we used the high confidence set. Yet everything is such a soft call... a 'somatic' mutation in one patient has some reads in the normal, and other samples have the variant as a polymorphism as well. I tried to use the preparation and variant calling pipeline to solve this problem, yet here I am at the end, feeling like I'm back to square one. Is there something obvious that I'm missing? (I have also used Mutect, Strelka, GATK, SomaticSniper, but the number of false positives seemed to be the least with Varscan2).

Thank you for the feedback, AOC

somatic dbsnp variant • 4.7k views
ADD COMMENTlink modified 2.9 years ago by Cyriac Kandoth4.4k • written 2.9 years ago by ao.carson80


I recently ran into a similar problem. I used MuTect to produce a somatic variant callset but noticed that many of the variants seem to come from reads with low mapping quality. After filtering for variant read MQ, the remaining variants seemed to be of high quality. We then went on to validation of the most interesting variants only to discover that many of those variants were due to systematic error in the sequencing technology; the variants which we thought were somatic were present at very low frequency in all samples (tumor and normal). To correct this in the future, we are filtering out variants in our somatic callset that are present in any normal sample or dbSNP. We are also annotating variants in high-GC regions, as the false positives seem to consistently come from those regions.

I would guess that many of your variants are likely due to systematic error as well, and the increased frequency of those variants in the tumor sample is due to chance rather than anything of biological significance. Probably the best and easiest solution is to filter out any called somatic variants present in any normal samples, although this may be a little drastic depending on what you are looking for.

ADD REPLYlink written 2.9 years ago by donfreed960
gravatar for Cyriac Kandoth
2.9 years ago by
Cyriac Kandoth4.4k
Memorial Sloan Kettering Cancer Center, New York, NY, United States
Cyriac Kandoth4.4k wrote:

In addition to the built-in filters, the VarScan2 paper also recommends using these post processing filters. Do the following in a Terminal window to download the latest version and read the documentation:

curl -LO
perl variant-filter-master/ --help

It mentions that you need to first run bam-readcount on all your variant loci, and pass the result to the script above. And it only works on point mutations for now. WashU has this filtering script for indels, but it is not yet portable like above.

ADD COMMENTlink modified 23 months ago • written 2.9 years ago by Cyriac Kandoth4.4k

Is there any portable script like yet for INDELS  false positive  removal from VarScan INDEL vcf file?

ADD REPLYlink written 23 months ago by vchris_ngs2.7k
gravatar for Charles Warden
2.9 years ago by
Charles Warden4.6k
Duarte, CA
Charles Warden4.6k wrote:

You can solve the low coverage problem by being more stringent with the --min-coverage, --min-coverage-normal, and --min-coverage-tumor parameters in VarScan. My guess is that this will make the biggest difference - you can be super conservative with this, if you really want.

If you haven't done so already, you can also set --p-value to 0.05 (not just --somatic-p-value, which is the default).

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Charles Warden4.6k

Thanks for your response cwarden. I initially used a normal coverage of 6 and tumor coverage of 8. I am running into 'low-coverage' sites being 15 reads in the normal, with 2 being variant, and 35 reads in the tumor, with 5 being variant. I don't understand how that is coming up as a high-confidence somatic mutation. It looks germline, or at least not as 'pure' as I thought my calls would be. I can try re-running it with the --p-value set lower that 99%. The way I read the manual I thought the somatic-p-value was the second step and would filter out those bad calls that made it past the 0.99 p-value. Thank you for your help.

ADD REPLYlink written 2.9 years ago by ao.carson80

You want the p-values to be lower rather than higher (so, 0.01 not 0.99).

Also, at the risk of asking the obvious, are you specifically looking at the "Somatic" and "LOH" variants? The output file also contains "Germline" variants.

I believe simply setting --p-value to 0.05 has been sufficient to give users good results in the past. Not sure why you are encountering problems.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Charles Warden4.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1491 users visited in the last hour