Question

Difference Between Samtools And Gatk Algorithms

9

Entering edit mode

11.4 years ago

hellbio ▴ 520

Hi all,

I am interested in finding the difference in algorithms used by the tools gatk and samtools.Although both of them use a bayesian model for genotype and SNP calling, the results are slightly different, as they produce different number of variants. Does anyone know, where this difference comes from?

gatk samtools • 27k views

ADD COMMENT • link updated 11.4 years ago by lh3 33k • written 11.4 years ago by hellbio ▴ 520

score 42 · Answer 1 · 2012-11-20

I have got the following email. I will put the answer here.

I have been using both GATK and Samtools for variant calling in individual samples. Both these tools uses a bayesian approach to call the variants but still produces slightly different number of variants. Is the difference between the two algorithms arise because of prior probabilities or likelihood calculation they take into account to calculate the posterior probability or any other cutoff values used in the algorithm?

I have read the following papers to figure out the difference between the algorithms. Unfortunately i am only able to find out the common methods carried by both the algorithms but not the difference leading to different number of SNP calls between the two.

Framework for variation discovery and genotyping from next-generation DNA sequencing (GATK).

Mapping short DNA sequencing reads and calling variants using mapping quality scores (Samtools)

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data

Preprocess of alignment. GATK drops reads with low mapping quality (if I am right), but samtools uses all reads by default. Samtools caps base quality by alignment quality. GATK did not do this at least initially. Mark and I used to talk about this. I do not know if GATK does the same as samtools nowadays. GATK also collapses overlapping ends in a read pair. This is the right thing to do. However, my impression is this will not greatly improve the accuracy from their analysis, so I have not implemented this bit in samtools. Furthermore, samtools uses BAQ by default. GATK has the BAQ feature, but probably not using it by default.
SNP genotype likelihood model. GATK assumes sequencing errors are independent while samtools believes the second error comes at a higher chance. This difference has little effect unless the depth is beyond hundreds of folds.
Indel genotype likelihood model. This is one of the major differences between samtools and GATK. Samtools' model is derived from BAQ. It employs quite a few ad hoc heuristic to identify alleles and so on. GATK's model was derived from Dindel's model initially, but the model has been subjected to various changes with time. I don't know how exactly it works now.
Variant calling model. The model behind samtools and GATK is nearly identical. This model was first proposed by Rasmum Nielsen and Richard Durbin independently and then more formally described in my samtools methodology paper. Before the publication, GATK has also implemented it using identical formulae. GATK later augmented the model to work with multi-allelic cases. Samtools does not include that bit.
Filtering. Another major difference. Samtools uses hand-tuned filters, while GATK learns filters from data. Of course GATK's approach is more convenient and powerful at least for human variant calling where you have enough data to train the model and you do not need non-polymorphic sites. In addition, for effective filtering, GATK collects more statistic about the alignment (e.g. haplotype score) than samtools. They are quite useful for the diagnostic purposes.
Other GATK specific features: haplotype caller, indel realignment among many others. Haplotype caller is the most important feature in GATKv2. It has been shown to have much higher sensitivity to longer indels.
Other samtools specific features: genotype-free analysis, physical phasing and so on.
Technical difference. GATK is clearly the more sophisticated and the more complicated. Nonetheless, samtools is equally scalable to huge data sets in spite of its much simpler framework.

GATK and samtools are similar overall, but those who have implemented a large project like samtools or GATK would know that subtle details, such as the thresholds in use, heuristic employed to reduce complexity, tricks for better performance and so on, can be very different, which is likely to lead to observable difference in the final results. In general, samtools focuses more on a multi-purpose tool. It tends to use filters producing less bias. It produces the consensus in a more uniform (and the right) way and has the ability to infers population parameters not doable with GATK. I would say GATK focuses more on variant calling for human data. It is indeed a better (I would not say not much better) variant caller than samtools for 1000g types of data, largely thanks to its advanced filtering and the bulk of training data available to human. On non-human data and other data types, probably they differ little in SNP accuracy.

score 5 · Answer 2 · 2012-11-16

5

Entering edit mode

11.4 years ago

Istvan Albert 100k

Here is a nice summary from Heng Li on some differences between the two (even though the original question was on a different topic):

A: genotype likelihood calculation in samtools is wrong?

ADD COMMENT • link 11.4 years ago by Istvan Albert 100k

0

Entering edit mode

Well, there have been several sources which are trying to describe the Bayesian genotype models used in both GATK and samtools. Could someone give a source of finding the SNP calling models used in both these algorithm?

ADD REPLY • link 11.4 years ago by hellbio ▴ 520