I am trying to get into popgen analysis in angsd and currently working on some summary statistics. This made me to think a little bit more about allele polarization for D-stat, f3, f4 stats, SFS and other analyses.
For D-statistic estimate, ANGSD is asking for ancestral fasta file. However, I am not sure what kind of fasta it should be. If all my BAMs are aligned, let's say, to hg19, but an outgroup is chimp, should I provide a reference PanTro genome? In this case, coordinates are different: BAMs are aligned to hg19.
Or, should I convert PanTro bam file aligned to hg19 into some kind of consensus fasta? Or, finally, I can realign all bams on chimp genome, and then use these realigned bams together with PanTro for the analysis. What is the best way to do that?
I guess it would be better to use a real outgroup to polarize alleles (especially when doing SFS), but some papers (as this one) use non-outgroup reference and do that using folded SFS with no problems.
In general, is there an optimal strategy for this kind of popgen decision making?