I am looking at some genotype data for some pooled DNA with a sample size of 10. The data is in dibayes format and for each snp tells me the coverage of that snp, the number of times the reference allele was counted and the number of times any other alleles were counted.
As an example one SNP had 7x coverage and was the minor allele was seen 1 out of 7 times and the major allele was seen the other 6.
How do you work out the minor allele frequency for this population. There are 20 chromosomes present but we've only seen 35% of them (7 out of 20) so we can't simply say the MAF is 14%
Also other SNPs may have higher or greater coverage and I presume you need to account for that somehow too.
I would ultimately like to create a simple allele frequency spectrum. I've had a look for some information on this but all the papers i have seen are way too complicated for what i need. Can anyone recommend a basic introduction to this analysis?
thanks a lot
if your intention is to do population statistics, you will have to work not at read level (coverage) but at sample level. the MAF value would be the number of times an allele appears in less samples than the other allele, and that doesn't have to do with the coverage. in fact the coverage would only help you with the SNP calling, but once the SNPs are called that's all.
there aren't many meaningful statistics you can do having only 10 samples, but you can try the following measurements: allele frequency (this is self-explanatory), heterozygosity (each snp's ratio of heteros/heteros+homos), or even local inbreeding (Fs). you won't be able to calculate other population statistics indices such as Fst or In because these measure distances inter-population, and not intra-populations.
I cannot think about any other best readings than basic population genetics text books (such as "Principles of Population Genetics" Hartl 1997, Sinauer Associates or "Population Genetics, a concise guide" Gillespie 1998, Johns Hopkins University Press), but for understanding F-statistics I've always recommended following this worked example by Dr. David McDonald's.
Here is a statistical analysis paper and a software for this with a funny name: PoPoolation.
And here is another paper that discusses optimal pooling strategies for NGS.
N.B: I have not read either at this point but have an interest in the methodology and application. Hope this helps.
I like Jorge's answer here (and his others on SNPs) very much. Think of this way. If the major allele is found 6 times but from 3 chromosomes (some chromosomes were read by the sequencing machine more than once) and the minor allele is found once, then the MAF is ~25% (1/4). And if the major allele is found 6 times from 6 different chromosomes and the minor allele is again found once, then the MAF is ~14% (1/7). This illustrates what Jorge wrote about the difference between sampling read data and sampling individual chromosomes/individuals.
I was editing my previous answer, and then I realized how long it became, so I decided to open a new answer since now it covers pooled DNA appropriately (although unfortunately doesn't completely solve your problem). I should have done it before, but I guess I didn't get the right point at first. here we go then...
MAF, in essence, measures how probable it would be to find a certain allele in a population. calculating it directly sampling individuals is straight-forward, but I guess that using pooled DNA some further statistics are to be followed. unfortunately I haven't done any work on that, and maybe Larry's idea is enough, but I guess some further reading may be appropriate. I just followed PMID:16643673 and discovered 2 papers (PMID:15677751 and PMID: 11140947) that describe methods for calculating allele frequencies on pooled DNA. also, the material and methods section of this paper seem to point out the appropriate statistics to use in order to obtain allele frequencies from pooled DNA.
having said all this, I must say that pooled DNA techniques have been studied for years in Sanger sequencing and genotyping, but not that much with NGS as far as I know. one major problem you may find if you rely on NGS reads counting is that you will have to consider heterozigosity, and you will have to know that an allele in heterozygosis will always be below-represented with NGS techniques. I really don't know if you can trust NGS data only to calculate allele frequencies, at least through such a straight method such as counting read differences, since there are several steps in the mapping and snp calling process that may introduce certain biases. I would encourage other BioStar readers to share any publication that may have covered this issue in particular: dealing with allele frequencies pooling DNA on NGS.
If there were no sequencing errors, base counting would be an unbiased estimator of site allele frequency. When there are sequencing errors, I am not aware of any simple estimators that are good enough. The two papers pointed by jvijai are good in theory, but I doubt their usefulness in practice. The first paper aims at variant discovery but not a good estimator of frequency. The second paper seems to assume accurate base quality, which is rarely the case.
As Jorge has pointed out, for 10 samples, the best way is to barcode them. In my opinion, the additional cost at barcoding is minor in comparison to what you gain. With barcoding, estimate can be much better.
If you are aiming at something simple with your current data, probably I would discard bases with low base or mapping quality and do base counts. The spectrum at f=0 is rubbish, but the density conditioned on f>0 should be about right.