Question: Approach To Counter Under-Sampling In Per-Site Dn/Ds Ratios
gravatar for pld
5.6 years ago by
United States
pld4.8k wrote:

I am interested in calculating the dN and dS values per site across all of the sites in a genome (including both coding and non-coding regions) using multiple sequences from each species in a highly genetically conserved genus of viruses. The challenge is that some species are highly under-represented in the sequences I have available, where as one species comprises over 50% of the available data. The problem is that while the sequences are very highly conserved (especially in the coding regions) there are many more poorly conserved regions. The over-representation of one species creates a situation where the more divergent species is, the more poorly represented it is in the data set. This seems to be creating a situation where the MSAs are highly sensitive to local optima and prone to being highly gapped when aligning larger sequences.

Rather than considering a single alignment for a given codon site (position, the nth codon site), what if I were to use a sliding window to generate alignments over a window and then average values for a given position?

codon alignment msa • 1.4k views
ADD COMMENTlink modified 5.6 years ago by ivanerill0 • written 5.6 years ago by pld4.8k

just curious, how you define dN, dS values for non-coding regions?

ADD REPLYlink written 5.6 years ago by Vitis2.0k

Just put it in a reading frame and call codons as you would. It is a neat trick when looking for areas that are under purifying selection to conserve a region of secondary structure. Basically, you wouldn't expect to see lower than average dS values in the non-coding regions, there is no selection acting on the codons (CUB, etc) and promoter sequences (especially in viruses) as noisy. So if you see a region of non-coding sequence under purifying selection to keep a codon, some other selection force must be playing a role. In the case of RNA viruses (especially ssRNA), this could be evidence for secondary structure. If I am remembering correctly, Edward Holmes has put a fair amount into this.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by pld4.8k

Thanks for the heads up, sounds like a very interesting approach to evaluate selective pressure on non-coding regions! In terms of your question, I think it's ultimately an alignment problem. You'll have to drop some regions if you can't get confident alignment for them as alignment is a very strong statement of homology, small mis-alignment would throw all the dN, dS values off.

ADD REPLYlink written 5.6 years ago by Vitis2.0k
gravatar for ivanerill
5.6 years ago by
ivanerill0 wrote:

CLUSTALW and other MSA programs should automatically be compensating for the overrepresentation of similar sequences. If you use a sliding window (without alignment), then you are spreading your dN/dS counts by loosing the reference frame. Why not just go with an alignment free metric, like n-mer frequency?

ADD COMMENTlink written 5.6 years ago by ivanerill0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1442 users visited in the last hour