Question: Approach To Counter Under-Sampling In Per-Site Dn/Ds Ratios
5.6 years ago by
United States
pld4.8k wrote:

I am interested in calculating the dN and dS values per site across all of the sites in a genome (including both coding and non-coding regions) using multiple sequences from each species in a highly genetically conserved genus of viruses. The challenge is that some species are highly under-represented in the sequences I have available, where as one species comprises over 50% of the available data. The problem is that while the sequences are very highly conserved (especially in the coding regions) there are many more poorly conserved regions. The over-representation of one species creates a situation where the more divergent species is, the more poorly represented it is in the data set. This seems to be creating a situation where the MSAs are highly sensitive to local optima and prone to being highly gapped when aligning larger sequences.

Rather than considering a single alignment for a given codon site (position, the nth codon site), what if I were to use a sliding window to generate alignments over a window and then average values for a given position?

modified 5.6 years ago by ivanerill0 • written 5.6 years ago by pld4.8k

just curious, how you define dN, dS values for non-coding regions?

written 5.6 years ago by Vitis2.0k

Just put it in a reading frame and call codons as you would. It is a neat trick when looking for areas that are under purifying selection to conserve a region of secondary structure. Basically, you wouldn't expect to see lower than average dS values in the non-coding regions, there is no selection acting on the codons (CUB, etc) and promoter sequences (especially in viruses) as noisy. So if you see a region of non-coding sequence under purifying selection to keep a codon, some other selection force must be playing a role. In the case of RNA viruses (especially ssRNA), this could be evidence for secondary structure. If I am remembering correctly, Edward Holmes has put a fair amount into this.

modified 5.6 years ago • written 5.6 years ago by pld4.8k

Thanks for the heads up, sounds like a very interesting approach to evaluate selective pressure on non-coding regions! In terms of your question, I think it's ultimately an alignment problem. You'll have to drop some regions if you can't get confident alignment for them as alignment is a very strong statement of homology, small mis-alignment would throw all the dN, dS values off.

written 5.6 years ago by Vitis2.0k
5.6 years ago by
ivanerill0 wrote:

CLUSTALW and other MSA programs should automatically be compensating for the overrepresentation of similar sequences. If you use a sliding window (without alignment), then you are spreading your dN/dS counts by loosing the reference frame. Why not just go with an alignment free metric, like n-mer frequency?

written 5.6 years ago by ivanerill0
