Question

why GATK generates more SNP than samtool does

0

Entering edit mode

8.0 years ago

CY ▴ 750

Could anyone explain to me why GATK generates far more SNPs than samtools does? I heard GATK is permissive one SNP calling. I just could not understand why since GATK has ways to eliminate false positive SNPs, for example through local realignment.

SNP next-gen sequencing • 1.5k views

ADD COMMENT • link 8.0 years ago by CY ▴ 750

0

Entering edit mode

Right, as you say, GATK is a very permissive SNP caller - calling SNPs other callers might not. That is because calling is treated as a totally independent and separate step to the filtering, which happens at the end.

This allows filtering to take into consideration the calls for multiple samples/runs at once. Maybe no individual sample alone has sequencing depth sufficient to call a SNP, but when all the different patient data is combined, it's clearly a SNP, and now wont be filtered out.

Also local realignment/BQSR/probabilistic caller.

ADD REPLY • link 8.0 years ago by John 13k

0

Entering edit mode

Probably lets call it a variant rather than SNP, this term might not be so well used in its true sense of term. Yes as you say only when you pool samples to see recurrent variations or point mutations in >1% then you can for that population of sample term it as SNP. But yes GATK has a very different way of calling a variant. However there are different ways to get it done. Lets say if you are calling for single sample then just a single caller might throw a large number of false positives depending on what criterions used for filtering the variants. One can use GATK/Somaticsnper/VarScan2/Mutect2 and then overlap the vcfs of all the call sets and then go for a restricted varaints that are seen by all the callers.

Right now what most people do or even I have seen in literature is to do all the processing step by GATK and then use different vairant callers and finally overlap the vcf to find a restricted variant set, but it is done if you have very few samples, if you have large number of samples then even GATK standard practice should work well in your case. Again if you see too many variants for your single sample you can always post filter them down to a restricted set depending upon thresholds of DP,AF, SB and many other quality metrics. In that you have to plot distributions of those to see above which you would keep post filtering

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I don't know about the "try everything and take the minimum set" approach. You see it all over the place for all sorts of things - peaks from peak callers, SNP/variant calls, disregulated genes, etc -- it's very common, but that's just not how statistics works. It's like saying "We need a single statistic to represent this distribution!" - "I like the mean!", "I like the median!", "I like the mode!" - "Ok, lets use the meaodian value, which is all three added together and divided by three." -- "But no, we should take the middle value of the three!" - "No, we should take the most common value of the three!", etc etc. Whatever you choose, your stat no longer means anything.

Usually it's just a lazy way of saying "There are many ways to do this, so instead of applying some critical thinking, I used all of them.", hehe :)

ADD REPLY • link 8.0 years ago by John 13k

1

Entering edit mode

Ah yes , true, I was not diving into the stats out here. First of all different tools have different statistical approaches to define a mutation. So it is a very independent assessment. I know overlap sounds a bit out of fashion and not stats relevant. Rather I would actually say that once I take most widely used variant callers and then use post filtering of variants based on the median distribution of the depth of allelic frequency values , I can fish out some top variants for each callers. In that case I can take an overlap of all different callers or might select the top variants from any one of them keeping the statistics of that call untouched. My knowledge of in-depth statistics is not that great here. So probably the approach I suggested might not be of interest. But having said that variant callers are overlapped for top variants to find true positives even though they have different statistical approaches to assign a variant. That was what I wanted to suggest. But I suppose you can share a better light here @John to the OP . I did not want the OP to be just biased with GATK, since GATK never worked for single samples for me, usually for pooled it gives a much better statistics for variations. Depending upon availability of samples and coverage the OP can also see for other tools like I mentioned before.

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Yes, sorry, I see your argument now and it's a good one. GATK isn't the only variant caller out there, and the other ones you suggest are also very good - particularly VarScan2 if you have 1 sample at high depth.

But still, you know, it's important to remember that it's not like all variant callers are totally independent from one another in terms of the results they produce. Say there are 1000 true variants in a sample. Program 1 gives you 400, and 50 false-positives. Program 2 gives you 450, and 90 false-positives. The problem with combining the outputs (in any way) is that depending on how similar the two programs are in how they work, you could return anything from 400 true-positives and 0 false-positives, to 0 true-positives and 50 false-positives. OK that's a little extreme, but what is undoubtedly true is that you started with two programs which individually had a very well defined false discovery rate (because they had been tested rigorously) - but after combining the outputs you no longer really know what you're dealing with. The more tests you combine, the more uncertain it becomes, rather than what people intuitively think is happening, that all the programs are working together to converge at some "high confidence" variant call set. For this reason, you are best sticking to 1 program that you can say works well on your data because XYZ, and keep it simple :)

This doesn't apply for things like mapping, where the unmapped output of one program can be fed to another, but it applies in most situations where people want to combine multiple programs together, like some sort of Captain Planet variant/peak/gene caller.

ADD REPLY • link 8.0 years ago by John 13k

0

Entering edit mode

Yes I agree to to the point you mention. Infact I am also not very much in favor of meddling with the idea of combining tests since the level of uncertainty increases and then it is difficult to test the significant of the call sets of combined variants, although I have learnt of something as combinatorial p-value assessments but am not an expert in it , so I cannot argue on that. So I usually stick to either VarScan2 or Mutect2 depending on my question of interest. If am interested in top variants I can always asses on the distribution of mutational frequencies and restrict my variants sets to more stricter ones and try to see if combine calls fish out the high variants or not. However at times I also go with Mutect2 if am interested in low frequency mutations which is not usually possible with VarScan2 since the approach is heuristic. I am not saying GATK does not work well, it is definitely one of the most standard one used for large scale variation analysis but for my cases where sample size does not exceed more than 10 or 15 (even the depth at most is 70x), I have always found the other 2 independent callers much of effective. I usually try to use GATK processed aligned files to mutation callers keeping in mind the refinement GATK does on the false positive matches and mismatching that might occur due to the indels. But yes if the OP wants to stick to GATK only then definitely need to figure out the call set distribution based on coverage, mutational frequencies or quality scores, that can give much high confidence calls. GATK calls might not be very strict unless you test out all the parameters independently and that might not be what the project is for and can on its own lead to another big project. If variation is what one is looking unbiasedly based on a single tool, one can try different callers and see for high confidence variants and if required do an assessment on combined variants or else if the top variants of any of the tools give me well annotated feature that infact is detrimental to the phenotype , then that be it can be carried for testing. GATK always recommends to use ones own filter strategies for calling variants and I wonder few papers infact test such for each callers at different thresholds. Most of them compare callers based on default value to claim which one is having higher sensitivity and specificity. Having said that I am happy to gain more knowledge from your inputs. :)

ADD REPLY • link 8.0 years ago by ivivek_ngs ★ 5.2k