[orthomcl] Proteins with more than one predicted ortholog
1
0
Entering edit mode
8.7 years ago

Hi everyone,

I have found the predicted orthologs for two fungi through orthomcl algorithm, but when I look at the output table many of the proteins of one fungal have more than one hit and the same occurs for the other fungal. How can I say one protein has two orthologs in the other fungal, or only one?

Besides, the table give me a "normalized score" to each pair of predicted orthologs. Does anyone know what it means? I was looking for any formula or simple explanation for it but the only thing I've found is this: "Normalize ortholog and co-ortholog pairs for any two species by averaging the e-values across them, and normalize using that average" (http://www.ncbi.nlm.nih.gov/pubmed/21901743). I know it is a normalized value related to evalue, but how? Curiously, the maximum value it is 1.576 and many of the orthologs with more than one hit in the another fungal have this score too.

An02g14170   e_gw1.1.1058.1   0.241
An01g08960   e_gw1.1.1090.1   1.576
An15g05520   e_gw1.1.1090.1   1.576

The parameters that I used to find the orthologs were these:

  • evalueExponentCutoff = -5 (BLAST evalue < or = to 1e-5; recommended parameter);
  • percentMatchCutoff = 70
  • I (inflation factor) = 1.5 (recommended parameter);

Thank you so much for any help!

orthomcl orthologs evalue • 3.0k views
ADD COMMENT
1
Entering edit mode
8.7 years ago

The score is described in the paper describing the OrthoMCL procedure (it's referenced in the article you mention). OrthoMCL is nothing else than clustering proteins based on sequence similarity. The advantage is scalability, the disadvantage is that you can't properly infer orthology relationships, for this you need a phylogenetic tree.

ADD COMMENT
0
Entering edit mode

@Jean-Karim: Thank you for your answer, but the only explanation in this paper is "a normalized similarity score" and it is recommended to see the Orthomcl Algorithm Document for the normalization function. I saw this document, but I'm not sure about what is the meaning of these score values yet. Would it be the formula present in the topic Find potential co-ortholog pairs? " Each CO(Ax,By) is given a pair weight: O(Ax,By) = (-log10(evalue(Ax,By)) + -log10(evalue(By,Ax))) / 2"? Furthermore, do you know which parameter in blastp can I use to see only 1:1 hits? Thanks again!

ADD REPLY
2
Entering edit mode

The description of the algorithm is in ref 7 of the paper you cite: Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003;13:2178-89.

In particular see fig2.

The raw score is as you describe above: the average of the -log of the e-values obtained by blastp A vs B and B vs A. This provides a measure of similarity between any two sequences. Before applying the MCL clustering algorithm, this score is normalized by dividing by the average weight of all pairs between the two specie e.g. for two genes A and B with A from fly and B from mouse, the raw score is (-log10(evalue(A,B)) + -log10(evalue(B,A))) / 2 and the normalized score is this divided by the average of all scores between fly and mouse. You don't need/want blastp to return only one hit, you just need to take the best one for each query sequence which should always be the first in the list returned by blastp.

ADD REPLY
0
Entering edit mode

Thank you so much for your help Jean-Karim. It's the first time I've read a good explanation about what is or how can I calculate the normalized score of MCL.

ADD REPLY

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6