Question

Understanding codeml output for orthology cluster and genefamily cluster

0

Entering edit mode

11.1 years ago

RB ▴ 20

Dear All,

I have conducted codeml analysis on orthology clusters of few different gene sets from a species. There is one recent study conducted on the same species. We have slight differences in methodology but I found my results very different than theirs. I am bit perplexed and tired of rechecking my data as I have used codeml for the first time and the thought of mistakes in the analysis is making me very nervous. Below are the difference in results and methodology. If someone could suggest me if I should expect that much difference or not then it would be a great help:

Difference in Results -> The codeml analysis in both the cases is conducted on the same species. I have conducted analysis on different gene sets and concluded that around 50-55% of gene families are under site-specific selection in this species. The other study has conducted analysis on randomly selected genes from the same species and concluded that around 9% genes in this species are under site-specific selection. I kept in consideration that different gene sets can experience different level of selection so I have selected a random sets of genes and the results are again 50-55%. So I concluded that difference in results is due to the difference in methodology.

Difference in Methods ->They have conducted analysis on full gene families (orthologs+paralogs) while mine is just orthologs. The other difference is that they have pulled gene families only from 6 species while mine orthology clusters contains ~ 40 species. Can I consider that the difference in results is due to full gene families?

When it comes to codeml models for site-specific selection they have used M0-M3 models. If no positive selection was found using the basic M2 and M3 model, they did not proceed for the M7 and M8 models. Whereas I compare m8 with m7 and m8 with m8a model. Someone suggested me a while ago that M2-M3 models are not successful in literature. Do these models yields false negatives? Though I could not find any research paper re this (If someone knows then can you send me).

Any help/suggestions will be highly appreciable.

Thanks,
Reb

codeml • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.1 years ago by RB ▴ 20

Ram · Answer 1 · 2014-07-02

Hi Reb,

It looks your result certain too many false positive PS genes.

The out-group size (number of other species your study used) looks large for me. You may need to confirm the quality of each sequences. If there is errors in sequence source or annotation, you might want to dismiss that one.
multiple sequence alignment (MSA) software is critical for positive selection detecting. Prank, although slow, was recommended.
some software like Gblock or Guidance can remove some unreliable region from MSA. You can try them if you like. Also if you take a "by-eye" inspection of your sequences, especially PS genes you detected, you will notice some kind of error in there.

Maybe this paper will be useful: http://www.ncbi.nlm.nih.gov/pubmed/20333182

Wish it helps.

Shuo