Recently, i carrying out the RNA-Seq analysis wiht more than 100 samples, and two natural populations (wild and domestication). After calling SNP, i wan to do soma population genetics statistics gene by gene, such as Ka, Ks, the ratio of Ka and Ks, Pi, Fst, Theta et al. So i am troubled that how to deal with the heterozygous SNP and missing data?
1.For each gene or SNP loci, should i calculate the character (such as pi, theta, Fst et al.) for each pairwise in sub-population and then get the average of this gene or SNP loci? I think so, but i am not sure. Maybe is it only need to calculate the character between the Ref and the consensus SNP for sub-population.
Heterozygous SNP loci are as missing data and do not calculate this loci? However, if don't calculate this part of SNP, the number of homozygous SNP is very little. How many samples have heterozygous SNP that we should discard this loci? Or, we should use the hapmap information to calculate it? For the diploid SNP, we regard one sample as two sample, for the homozygous SNP as the same two sample, heterozygous SNP as the different two sample, then pairwise comparison in the sub-population?
How about the missing data?
Are somebody like to give instruction of these or share some script for calculating these value from SNP loci ?