I am interested in calculating the dN and dS values per site across all of the sites in a genome (including both coding and non-coding regions) using multiple sequences from each species in a highly genetically conserved genus of viruses. The challenge is that some species are highly under-represented in the sequences I have available, where as one species comprises over 50% of the available data. The problem is that while the sequences are very highly conserved (especially in the coding regions) there are many more poorly conserved regions. The over-representation of one species creates a situation where the more divergent species is, the more poorly represented it is in the data set. This seems to be creating a situation where the MSAs are highly sensitive to local optima and prone to being highly gapped when aligning larger sequences.
Rather than considering a single alignment for a given codon site (position, the nth codon site), what if I were to use a sliding window to generate alignments over a window and then average values for a given position?