I have a bunch of protein-coding DNA sequences in the correct reading frame, such that upon translation, I get the protein sequences.
If I align in the corresponding amino acid sequences, each gap is a codon, but if I align in the DNA sequences, it can add gaps that are not multiples of 3s, causing frame shifts.
With these DNA alignments, I can build decent trees, and proceed with looking for positive selection with site models in codeml.
I then end up having sites like this:
16 * 0.99944 0.00050 0.00006 ( 1) 0.051 +- 0.028
where *
is the codon that's broken during the DNA alignment, in the first sequence.
Is the result of this analysis then wrong? Should I be aligning in amino acid space, converting back to DNA (with the correct codons in the DNA), and doing all the subsequent analyses based on that?
Further, how are gaps (-
) or gibberish codons (*
) treated in PAML? Are they ignored?