Question: How Is Heterozygosity Determined In A Consensus Fasta?
0
5.8 years ago by
Justin440
United States
Justin440 wrote:

I have a consensus fasta file generated by samtools/bcftools/vcfutils as discussed in a previous post here: How to generate a consensus fasta sequence from SAM tools pileup?

In some sites, I see e.g. R, which means that site was heterozygous A/G.

How does it determine it's heterozygous? E.g. if you see 50 A's and 39 G's, how do you know it's a het? By using a minor allele % threshold? Or using a probability model? If it uses probability, are there any references out there that describe the math?

consensus samtools • 1.9k views
modified 5.8 years ago by Sean Davis25k • written 5.8 years ago by Justin440
1
5.8 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#How_are_SNPs_and_indels_called_and_filtered_by_SAMtools.3F

In short, the MAQ paper is where the method is described:

http://www.ncbi.nlm.nih.gov/pubmed/18714091

Thanks!

For those interested in the math, see the heading "Consensus genotype calling" in the "Methods" section of the paper.

Basically, it's a Bayesian probability model where you want P(genotype g | observed data D), which you can get from Bayes' theorem if you have P(D | g), which the paper explains how to get. Then you estimate g as g* = argmax[g] P(g | D)