Question: What Do You Expect As The False Positive And Negative Rate For Snp'S And Indels In Wgs?
5.0 years ago by
William4.2k wrote:

What do you expect as the false positive and negative rate for SNP's and INDELS in a WGS experiment?

On which papers and data sets (inhouse or external) do you base this?

Edit: Of course this depends on a lot of thing as mentioned in the comment below. So let's assume a very vanilla situation: Human genome or popular model organism with a relative good assembly, relative close sample, maybe excluding difficult to sequence / map regions, Ilumina 100 x 100 reads, sequenced 30 x, mapped with BWA, SNP INDEL called with GATK.

Won't the answer depend heavily on species, genetic closeness to the reference sequence, type of sequencing, depth of sequencing, SNP/indel filtering, SNP/indel calling method, and probably a few more things that I haven't listed?

5.0 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

See Brad Chapmans's blog, Blue Collar Bioinformatics for articles like this one:

Framework for evaluating variant detection methods: comparison of aligners and callers

There is an entire series on this subject.

Again I can't seem to find the exact method they used to calculate the concordance between indels.

5.0 years ago by
New York
Vitis1.6k wrote:

O'Rawe et al. 2013 is an excellent paper describing the concordance of indel calls from different variant callers. I think in it there are some details about how they 'aligned' the indel calls to make them comparable, because it is usually not trivial to correct and compare indel start/end sites from different callers. Also, the main message of the paper is that the concordance is low, which means different callers usually have their unique indel calls, and the overlap among callers are not as good as we would like to see. The link to the paper is:

"For indel calls, initial agreement between SOAPindel, SAMtools and GATK was very low at 3.0% (see Additional file 1, Figure S8). Indel coordinates were subsequently left-normalized and intervalized using a total range of 20 genomic coordinates (10 bp in each direction of their genomic coordinates)"

