I have gene alignments between humans and chimpanzees and I need to remove GC dinucelotides between humans and chimpanzees. My question involves the best way to proceed with this. Is it at the codon level or at the sequence level?
For instance, if I have the sequence
ACTGCA this can be split into the two codons
GCA. Therefore I can remove the second codon from both the human and chimpanzee sequence and the alignment length should be fine. The problem with this method is that it doesn't account for GC dinucelotides that are across codons (e.g.
TCGCAA, where splitting into codons gives us
The alternative is simply to remove every GC dinucleotide from the sequence, but this may end up reducing the sequence to a length that isn't divisible by 3 (i.e. we cannot neatly split it into codons). For example, if we remove all GC dinucelotides from the sequence
TCAGCGCAT we are left with
TCAAT which is an incorrect length. As I am dealing with alignments between humans and chimpanzees (and will be running PAML which requires sequences to be of length divisible by 3), this could be problematic. This is likely quite an obvious problem but I am unsure of how to proceed. Any suggestions?
EDIT: As per the comment below, the reason we wish to do this is because CpGs have much higher rates of mutation than other dinucleotides in humans.The problem here is that the density of CpGs differs between synonymous and non-synonymous sites. We are pooling sites to calculate rates of adaptive evolution for different amino acids.
Why? Your entire question revolves around this need yet there is no explanation for this need.
Hi, please see the edit. Thanks.
You may be better off soft/hard masking
GCs and using an alignment tool that works well with masked sequences (most of them should).