Removing GC dinucleotides from a sequence
0
0
Entering edit mode
3.0 years ago
spiral01 ▴ 110

I have gene alignments between humans and chimpanzees and I need to remove GC dinucelotides between humans and chimpanzees. My question involves the best way to proceed with this. Is it at the codon level or at the sequence level?

For instance, if I have the sequence ACTGCA this can be split into the two codons ACT and GCA. Therefore I can remove the second codon from both the human and chimpanzee sequence and the alignment length should be fine. The problem with this method is that it doesn't account for GC dinucelotides that are across codons (e.g. TCGCAA, where splitting into codons gives us TCG and CAA).

The alternative is simply to remove every GC dinucleotide from the sequence, but this may end up reducing the sequence to a length that isn't divisible by 3 (i.e. we cannot neatly split it into codons). For example, if we remove all GC dinucelotides from the sequence TCAGCGCAT we are left with TCAAT which is an incorrect length. As I am dealing with alignments between humans and chimpanzees (and will be running PAML which requires sequences to be of length divisible by 3), this could be problematic. This is likely quite an obvious problem but I am unsure of how to proceed. Any suggestions?

EDIT: As per the comment below, the reason we wish to do this is because CpGs have much higher rates of mutation than other dinucleotides in humans.The problem here is that the density of CpGs differs between synonymous and non-synonymous sites. We are pooling sites to calculate rates of adaptive evolution for different amino acids.

SNP alignment sequence gene • 718 views
0
Entering edit mode

I need to remove GC dinucelotides between humans and chimpanzees

Why? Your entire question revolves around this need yet there is no explanation for this need.

0
Entering edit mode

Hi, please see the edit. Thanks.

0
Entering edit mode

You may be better off soft/hard masking GCs and using an alignment tool that works well with masked sequences (most of them should).