Approach To Counter Under-Sampling In Per-Site Dn/Ds Ratios
1
0
Entering edit mode
10.7 years ago
pld 5.1k

I am interested in calculating the dN and dS values per site across all of the sites in a genome (including both coding and non-coding regions) using multiple sequences from each species in a highly genetically conserved genus of viruses. The challenge is that some species are highly under-represented in the sequences I have available, where as one species comprises over 50% of the available data. The problem is that while the sequences are very highly conserved (especially in the coding regions) there are many more poorly conserved regions. The over-representation of one species creates a situation where the more divergent species is, the more poorly represented it is in the data set. This seems to be creating a situation where the MSAs are highly sensitive to local optima and prone to being highly gapped when aligning larger sequences.

Rather than considering a single alignment for a given codon site (position, the nth codon site), what if I were to use a sliding window to generate alignments over a window and then average values for a given position?

codon msa alignment • 2.2k views
ADD COMMENT
0
Entering edit mode

just curious, how you define dN, dS values for non-coding regions?

ADD REPLY
0
Entering edit mode

Just put it in a reading frame and call codons as you would. It is a neat trick when looking for areas that are under purifying selection to conserve a region of secondary structure. Basically, you wouldn't expect to see lower than average dS values in the non-coding regions, there is no selection acting on the codons (CUB, etc) and promoter sequences (especially in viruses) as noisy. So if you see a region of non-coding sequence under purifying selection to keep a codon, some other selection force must be playing a role. In the case of RNA viruses (especially ssRNA), this could be evidence for secondary structure. If I am remembering correctly, Edward Holmes has put a fair amount into this.

ADD REPLY
0
Entering edit mode

Thanks for the heads up, sounds like a very interesting approach to evaluate selective pressure on non-coding regions! In terms of your question, I think it's ultimately an alignment problem. You'll have to drop some regions if you can't get confident alignment for them as alignment is a very strong statement of homology, small mis-alignment would throw all the dN, dS values off.

ADD REPLY
0
Entering edit mode
10.7 years ago
ivanerill • 0

CLUSTALW and other MSA programs should automatically be compensating for the overrepresentation of similar sequences. If you use a sliding window (without alignment), then you are spreading your dN/dS counts by loosing the reference frame. Why not just go with an alignment free metric, like n-mer frequency?

ADD COMMENT

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6