6.2 years ago by
I have got the following email. I will put the answer here.
I have been using both GATK and Samtools for variant calling in
individual samples. Both these tools uses a bayesian approach to call
the variants but still produces slightly different number of variants.
Is the difference between the two algorithms arise because of prior
probabilities or likelihood calculation they take into account to
calculate the posterior probability or any other cutoff values used in
I have read the following papers to figure out the difference between
the algorithms. Unfortunately i am only able to find out the common
methods carried by both the algorithms but not the difference leading
to different number of SNP calls between the two.
Framework for variation discovery and genotyping from
next-generation DNA sequencing (GATK).
Mapping short DNA sequencing reads and calling variants using
mapping quality scores (Samtools)
The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data
Preprocess of alignment. GATK drops reads with low mapping quality (if I am right), but samtools uses all reads by default. Samtools caps base quality by alignment quality. GATK did not do this at least initially. Mark and I used to talk about this. I do not know if GATK does the same as samtools nowadays. GATK also collapses overlapping ends in a read pair. This is the right thing to do. However, my impression is this will not greatly improve the accuracy from their analysis, so I have not implemented this bit in samtools. Furthermore, samtools uses BAQ by default. GATK has the BAQ feature, but probably not using it by default.
SNP genotype likelihood model. GATK assumes sequencing errors are independent while samtools believes the second error comes at a higher chance. This difference has little effect unless the depth is beyond hundreds of folds.
Indel genotype likelihood model. This is one of the major differences between samtools and GATK. Samtools' model is derived from BAQ. It employs quite a few ad hoc heuristic to identify alleles and so on. GATK's model was derived from Dindel's model initially, but the model has been subjected to various changes with time. I don't know how exactly it works now.
Variant calling model. The model behind samtools and GATK is nearly identical. This model was first proposed by Rasmum Nielsen and Richard Durbin independently and then more formally described in my samtools methodology paper. Before the publication, GATK has also implemented it using identical formulae. GATK later augmented the model to work with multi-allelic cases. Samtools does not include that bit.
Filtering. Another major difference. Samtools uses hand-tuned filters, while GATK learns filters from data. Of course GATK's approach is more convenient and powerful at least for human variant calling where you have enough data to train the model and you do not need non-polymorphic sites. In addition, for effective filtering, GATK collects more statistic about the alignment (e.g. haplotype score) than samtools. They are quite useful for the diagnostic purposes.
Other GATK specific features: haplotype caller, indel realignment among many others. Haplotype caller is the most important feature in GATKv2. It has been shown to have much higher sensitivity to longer indels.
Other samtools specific features: genotype-free analysis, physical phasing and so on.
Technical difference. GATK is clearly the more sophisticated and the more complicated. Nonetheless, samtools is equally scalable to huge data sets in spite of its much simpler framework.
GATK and samtools are similar overall, but those who have implemented a large project like samtools or GATK would know that subtle details, such as the thresholds in use, heuristic employed to reduce complexity, tricks for better performance and so on, can be very different, which is likely to lead to observable difference in the final results. In general, samtools focuses more on a multi-purpose tool. It tends to use filters producing less bias. It produces the consensus in a more uniform (and the right) way and has the ability to infers population parameters not doable with GATK. I would say GATK focuses more on variant calling for human data. It is indeed a better (I would not say not much better) variant caller than samtools for 1000g types of data, largely thanks to its advanced filtering and the bulk of training data available to human. On non-human data and other data types, probably they differ little in SNP accuracy.
6.2 years ago by
lh3 ♦ 31k