I am interested in calculating the dN and dS values per site across all of the sites in a genome (including both coding and non-coding regions) using multiple sequences from each species in a highly genetically conserved genus of viruses. The challenge is that some species are highly under-represented in the sequences I have available, where as one species comprises over 50% of the available data. The problem is that while the sequences are very highly conserved (especially in the coding regions) there are many more poorly conserved regions. The over-representation of one species creates a situation where the more divergent species is, the more poorly represented it is in the data set. This seems to be creating a situation where the MSAs are highly sensitive to local optima and prone to being highly gapped when aligning larger sequences.
Rather than considering a single alignment for a given codon site (position, the nth codon site), what if I were to use a sliding window to generate alignments over a window and then average values for a given position?
just curious, how you define dN, dS values for non-coding regions?
Just put it in a reading frame and call codons as you would. It is a neat trick when looking for areas that are under purifying selection to conserve a region of secondary structure. Basically, you wouldn't expect to see lower than average dS values in the non-coding regions, there is no selection acting on the codons (CUB, etc) and promoter sequences (especially in viruses) as noisy. So if you see a region of non-coding sequence under purifying selection to keep a codon, some other selection force must be playing a role. In the case of RNA viruses (especially ssRNA), this could be evidence for secondary structure. If I am remembering correctly, Edward Holmes has put a fair amount into this.
Thanks for the heads up, sounds like a very interesting approach to evaluate selective pressure on non-coding regions! In terms of your question, I think it's ultimately an alignment problem. You'll have to drop some regions if you can't get confident alignment for them as alignment is a very strong statement of homology, small mis-alignment would throw all the dN, dS values off.