Hi,
I've been trying to get Ka/Ks scores of genome-wide pairwise alignments of coding sequences from two strains of the same organism. (i.e. very high similarity)
So far I'm using a custom python script to remove stop codons, Muscle to align the sequences and finally pipe that into KaKs Calculator. I get like 10% of CDS having the coefficient greater than 1, which I find highly unlikely and attribute that to bad alignment.
What alignment programs do you use when trying to align coding sequences (with different sizes and indels) for Ka/Ks analysis?
Hey, thanks a lot for the link! The basic idea here is that one of the strains is the originator of the other strain. One strain developed from the other and I'm using the originator as the ancestor and the other one to compare to that ancestral strain. Hopefully that makes sense?
Hey,
Use the originator as the ancestor seems correct, but it is only if this sequence corresponds to the real ancestral state (as example you already sequenced the strain in the past). In other words if the sequence that you consider as ancestral is a contemporary sequence of your strain, this sequence has evolved independently since it gave birth to the other strain.
Often we use only contemporary sequences and the tools assess the ancestral state of the studied sequences.
About the large part of sequence with a value greater than one:
You have to be careful about the polymorphism effect. Indeed, if the divergence time between your two sequences is really low, your Kn will be higher and your result biased. Indeed synonymous and non-synonymous mutation occur randomly at the same rate. But some time is needed for the purification/selection effect plays its role. ( mutations giving less fitness or serious problems are not selected, and vice versa).