How to polarize ancestral versus derived alleles?
6.4 years ago
Alice ▴ 320

Hello biostars,

I am trying to get into popgen analysis in angsd and currently working on some summary statistics. This made me to think a little bit more about allele polarization for D-stat, f3, f4 stats, SFS and other analyses.

For D-statistic estimate, ANGSD is asking for ancestral fasta file. However, I am not sure what kind of fasta it should be. If all my BAMs are aligned, let's say, to hg19, but an outgroup is chimp, should I provide a reference PanTro genome? In this case, coordinates are different: BAMs are aligned to hg19.

Or, should I convert PanTro bam file aligned to hg19 into some kind of consensus fasta? Or, finally, I can realign all bams on chimp genome, and then use these realigned bams together with PanTro for the analysis. What is the best way to do that?

I guess it would be better to use a real outgroup to polarize alleles (especially when doing SFS), but some papers (as this one) use non-outgroup reference and do that using folded SFS with no problems.

In general, is there an optimal strategy for this kind of popgen decision making?

Apologies that no-one else has responded. It is a very specific type of analysis that you are aiming to do, but very interesting I must admit.

From what I can see, ANGSD could accept a BAM aligned to hg19 and another aligned to the Pan troglodytes, however, this may not necessarily be the correct way to run the program.

I noticed this recent study, which appeared to run ANGSD separately on 3 different species:

Thanks! Yeah, it is not an easy question. I ended up aligning chimp on hg19. Other part of my question is very theoretical, I looked through the literature to see what people do - and they do whatever data allows. Some for examples do not have a sequenced outgroup so they just use a reference.

Hi Alice, I am doing unfolded SFS, I didn't know how to use a real outgroup to polarize alleles, can you give me some suggestions?


