Question: What Is The Best Method To Find Orthologous Genes Of A Species?
gravatar for Ct586
9.8 years ago by
Ct586560 wrote:

OP took these paragraphs from Wikipedia:

"Several specialized biological databases provide tools to identify and analyze orthologous gene sequences. These resources employ approaches that can be generally classified into those that are based on all pairwise sequence comparisons (heuristic) and those that use phylogenetic methods. Sequence comparison methods were first pioneered by COGs, now extended and automatically enhanced by the eggNOG database. InParanoid focuses on pairwise ortholog relationships. OrthoDB appreciates that the orthology concept is relative to different speciation points by providing a hierarchy of orthologs along the species tree. Other databases that provide eukaryotic orthologs include OrthoMCL, OrthoMaM for mammals, OrthologID and GreenPhylDB for plants.

Tree-based phylogenetic approaches aim to distinguish speciation from gene duplication events by comparing gene trees with species trees, as implemented in resources such as TreeFam and LOFT. A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct clusters and determine trees, for example Ortholuge, EnsemblCompara GeneTrees and HomoloGene."

I want to know which is the best tool if I want to find ortholog genes of all human genes in primates, mouse.

Thank You!

orthologues • 79k views
ADD COMMENTlink modified 6.3 years ago by Prakki Rama2.4k • written 9.8 years ago by Ct586560

Perhaps it's silly to comment on it so long after the fact, but the first two paragraphs are reproduced verbatim from the Wikipedia article on homology. The question should be edited to show them as a quote, and indicate the source of the quote. I unfortunately don't have enough rep to propose edits to questions.

ADD REPLYlink written 7.1 years ago by Superbest120

Also see discussion here.

ADD REPLYlink modified 13 months ago by _r_am32k • written 9.4 years ago by Khader Shameer18k
gravatar for jhc
9.8 years ago by
jhc2.9k wrote:

As it has been said, there are many options for orthology prediction. The truth is that most of them exist because of a specific reason. In other words, it is difficult to provide an absolute answer to the question "Which is the best one?".

Nevertheless, the following reflexion helps me to decide what works better for each of my projects:

Orthology and paralogy, as originally defined by Fitch, are both evolutionary concepts. This is, orthologous genes are homologous sequences that started to diverge through a speciation event (the same with paralogs and duplication events). Consequently, the better you can approximate the evolution of such sequences, the better your orthology predictions will be.

In this respect, phylogenetic reconstruction is expected to provide you with the best evolutionary view. Therefore, by analyzing the phylogenetic trees (i.e. using tree reconciliation algorithms) it is possible to derive a collection of fine-grained predictions of all orthology relationship among sequences.

However, reconstructing gene phylogenies using the most modern and accurate methods is computationally very intensive (and they are not free of artifacts). As a consequence, this approach is prevented of being used at large scale if you do not count with enough computational power. Generally speaking, if your species of interest are available as precomputed predictions in any phylogeny-based database, is good to try. Otherwise, you can move to alternative methods based on pairwise sequence comparisons. These methods are faster and can usually cope with larger amounts of data.

There is also a third independent alternative that consist of inferring the evolution of genes (and therefore their relationships) based on other genomic features rather than their coding sequence. For instance, the YGOB database can be used to obtain orthology and paralogy predictions based on the gene order conservation among several species. This approach is usually considered as very reliable, and sometimes it is used as a golden-set for benchmarks.

Phylogeny-based analysis will be better choice if (among other reasons):

  • you are trying to predict orthology for a very intricate gene family, including many duplications, gene losses, etc.

  • you need a fine-grained distinction among, many-to-many, one-to-many and one-to-one relationships.

  • you need orthology and paralogy predictions among many species at the same time.

  • you want to know about gene losses.

-Note that phylogenetic trees are not perfect. They are not free of artifacts and they can lead (as other methods) to wrong predictions in the case of lineage sorting or horizontal gene transfer.-

Blast-based methods are much faster and provide good results. There are many tools that you can use to generate your own predictions. You will need to decide among them by considering their limitations and specific scope. For instance,

  • Do you need a very fast approach to find pairs of orthologs in many species? (Best Reciprocal Hits)

  • Is it crucial to differentiate one-to-one orthologs from sequences with in-paralogs? (InParanoid, COG, etc.)

  • Do you need cross relationships among more than two species? (MultiParanoid, orthoMCL)

-Note that many of these tools also provide precomputed data.-

An incomplete summary of resources:

(with special focus on phylogenetic based predictions)

Phylogeny based methods

  • MetaPhOrs (precomputed data): It combines predictions from many different databases and provide a consistency score for each orthology relationship. Useful to find highly reliable predictions. Data can be browsed interactively or downloaded from an FTP sever.

  • EnsemblCompara (precomputed data): Phylogeny based orthology and paralogy predictions. Ensembl bases its predictions in the analysis of gene family trees reconstructed using TreeBest (PhyML with fixed evolutionary model, DNA and protein analysis, slighted guided trees for better tree reconciliation).

  • PhylomeDB (precomputed data): It bases its predictions in a per-gene phylogenetic analysis (PhylML testing several evolutionary models and alignment timing and optimization). Note that, while Ensembl is a general purpose database, PhylomeDB is organized in "phylomes", which are genome wide collections of trees whose taxon sampling and analysis design is usually hypothesis driven. From the publication on Metaphors, PhylomeDB uses Metaphors to measure the reliability of their phylome-based predictions.

    In general terms, Both Ensembl and phylomeDB tend to benchmark very similar (with good results) and they provide convenient API access to the DB and FTP downloads.

  • TreeFam (precomputed data): Similar to EnsemblCompara but it includes a set of manually curated trees. It seems to be discontinued, latest release dates from Feb 2009.

  • PHOG, analysis of precomputed phylogenies using a slightly different method.

Blast-based approaches

  • Inparanoid (precomputed data and standalone application): Predictions between pairs of species. It accounts for one-to-many and many-to-many relationships.

  • EggNOG (~COG) (precomputed data): Comprehensive catalog (630 species, including bacteria and archaea) of functionally annotated orthologs groups. An all-against-all blast comparison is used to build the orthologs groups. It accounts for in-paralogs.

  • OrthoMCL, MultiParanoid: Extensions of the previous methods. They add the possibility of generate predictions of several species at the same time.

  • Best Reciprocal hits (BRH): The simplest method. Still very useful when only the best orthologus pairs between two species are required.

Some benchmarks (among others) (figure 4) (Figure 1) (Figure 3)

Short answer

Human, other primates and mouse are well represented in MetaPhOrs (combining Ensembl and other DB predictions). Following the above reasoning, I would give it a try.

Hope it helps!!

ADD COMMENTlink modified 9.8 years ago • written 9.8 years ago by jhc2.9k

+1. Very comprehensive and pretty precise answer.

ADD REPLYlink written 9.8 years ago by lh332k

Thank you! Very useful. I deeply appreciate your answer.

ADD REPLYlink written 9.8 years ago by Ct586560

One comment (I just came across this Q&A after a similar question was asked recently) is that phylogenetic reconstruction works best for identifying deep paralogy where paralogs are cleary separated into unique sub-trees. But there are issues of inparalogs and coorthologs as well where things get a lot trickier.

Doing a de novo method like OrthoMCL can really help here when you have lots of lineage specific duplications, meaning orthologs and paralogs are interleaved throughout the tree.

ADD REPLYlink written 9.4 years ago by DG7.2k
gravatar for lh3
9.8 years ago by
United States
lh332k wrote:

yhc gave a wonderful answer. I wanted to comment on his/hers, but hit the word limit.

1) Definition of orthologs. Fitch's definition is the most widely accepted. IMO, it is also more precise and evolutionarily meaningful than the several alternatives. If you want to find orthologs, go for databases using such a definition (e.g. Ensembl, TreeFam and InParanoid).

2) In general, I prefer tree-based method, especially for mammalians and perhaps also vertebrates. With a tree you can visually tell if the inference makes sense, which is a huge advantage. Another advantage of tree-based methods over pairwise methods is that tree-based methods produce consistent results across species. For example, say A is a 1:1 ortholog of B and B is a 1:1 ortholog of C. In principle, A is a 1:1 ortholog to C (not true if not 1:1), but a pairwise method cannot always guarantee this.

3) However, tree-based methods are not necessarily better than other methods. Reconstructing trees is very difficult. It is quite possible to come up with a purely heuristic method to achieve better results.

4) For tree-based methods, it is important to build gene trees considering species tree, or try to fix the tree topology with the species tree as sort of a prior. Blindly building a gene tree (even using the best algorithm) and then do the standard reconciliation will give very bad inference.

5) Tree-based methods do not work well for bacteria due to the lack of a good species tree and LGT/HGT. LGT very rarely, if ever, happens to mammalians.

6) For mammalians, nucleotide trees tend to reflect the true evolution in comparison to protein trees. A paper is arguing a protein guided nucleotide alignment is the best for building trees. This is also my experiences. Ensembl/TreeFam are using that.

7) For primates and rodents, EnsemblCompara is probably the best choice. It may not be the most accurate, but should be good enough for most purposes. I usually do not like to take the results by combining predictions. It is good for method comparison, but leads to various artifacts that are hard to understand.

ADD COMMENTlink written 9.8 years ago by lh332k

For 1:1 ortholog, transitivity always stands.

ADD REPLYlink written 9.8 years ago by lh332k

My pre-work used Ensembl data, I will use EnsemblCompara to do it first. Thank you! I learned a lot through your words.

BTW, it is jhc not yhc whom you commented.

ADD REPLYlink written 9.8 years ago by Ct586560

'For example, say A is a 1:1 ortholog of B and B is a 1:1 ortholog of C. In principle, A is a 1:1 ortholog to C (not true if not 1:1)'. I would be very careful with that as orthology is non transitive term! However it's true, that in case of one-to-one orthologs that's often true.

ADD REPLYlink written 9.8 years ago by Leszek4.1k
gravatar for Michi
9.8 years ago by
Michi970 wrote:

I doubt that ANYBODY knows which is the best. But if you want to do it systematically, you may try the web-API of EnsemblCompara you mentioned

Take a look at the tutorial and read the chapter Homologies and Protein clusters

ADD COMMENTlink written 9.8 years ago by Michi970

Thank you! I will try this!

ADD REPLYlink written 9.8 years ago by Ct586560
gravatar for Casey Bergman
9.8 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

Another option is Phylofacts orthology groups (PHOG). This resource provides a partial answer to your question, since they have performed accuracy assessment analyses using a sample of 100 Treefam proteins as a reference in their 2009 NAR paper:

Results on a benchmark dataset from the TreeFam-A manually curated orthology database show that PHOG provides a combination of high recall and precision competitive with both InParanoid and OrthoMCL, and allows users to target different taxonomic distances and precision levels through the use of tree-distance thresholds. For instance, OrthoMCL-DB achieved 76% recall and 66% precision on this dataset; at a slightly higher precision (68%) PHOG achieves 10% higher recall (86%). InParanoid achieved 87% recall at 24% precision on this dataset, while a PHOG variant designed for high recall achieves 88% recall at 61% precision, increasing precision by 37% over InParanoid.

The authors provide precision-recall curves for different methods, e.g...

alt text

...and a more detailed also provide a more detailed description of their assessment here.


See also Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes Chen F, Mackey AJ, Vermunt JK, Roos DS. PLoS One. 2007 Apr 18;2(4):e383.

ADD COMMENTlink modified 14 months ago by _r_am32k • written 9.8 years ago by Casey Bergman18k

Thank you! Great paper!

ADD REPLYlink written 9.8 years ago by Ct586560

wow I have never seen this paper and graph, good to know that some body has done such a comparison

ADD REPLYlink written 9.4 years ago by Sudeep1.6k
gravatar for Treylathe
9.8 years ago by
San Francisco
Treylathe950 wrote:

Agreed with the above, not sure it would be easy to assess which is best, but the ones you named are good:

You might want to try out MetaPhOrs.

I did a blog tip (video) on it a couple weeks ago.

ADD COMMENTlink written 9.8 years ago by Treylathe950

Thank you ! Great blog by the way!

ADD REPLYlink written 9.8 years ago by Ct586560
gravatar for Philippe
9.8 years ago by
Barcelona, Spain.
Philippe1.9k wrote:


I also think there is no best methods. I guess you should take in account what you really need for your project. Depending on the analyses you might want to perform later and the biological insights behind it maybe some methods are more suitable than other. You also said you wanted to retrieve human orthologs. In which species? Mammalian species only or more distant ones? Depending on the answers to these question maybe you could focus on a shorter list of databases using some specific methods.

If you really hesitate between different resources one good solution might be to build several gene sets based on different databases. You can then assess the differences between them (that might not be that huge) and have a more precise view of your data. Many databases offer the possibility to easily download/access some preprocessed data. Building different sets of orthologous groups shouldn't ask you too much time.

ADD COMMENTlink written 9.8 years ago by Philippe1.9k

Thank you! This helps a lot. I have also had the idea to compare different methods. You help me put my foot down.

ADD REPLYlink written 9.8 years ago by Ct586560
gravatar for Alastair Kerr
9.8 years ago by
Alastair Kerr5.3k
Manchester/UK/Cancer Biomarker Centre at CRUK-MI
Alastair Kerr5.3k wrote:

One more for your list: OMA browser

The question 'which data-source is best' is rather a moot point. More important is what evidence can be gathered to determine the likelihood of a specific ortholog prediction. e.g. is there synteny evidence? Does the gene match the species tree? Is the data only a bi-directional best hit?

The easiest predictions are when there has been no gene duplication event since the last common ancestor (LCA) of the genes you are examining. Genetreeview in ensembl compara should help to determine this. The hardest is in highly paralogous families that are known to undergo recombination / concerted evolution. e.g. GPCR's / Protein kinases. For the latter look for specialised databases which focus on them, particularly focusing on their active sites (otherwise you can focus on a protein domain from the LCA that has a different evolutionary history than the active site).

ADD COMMENTlink written 9.8 years ago by Alastair Kerr5.3k

Thank you for your analysis. I leanred a lot and will rethink it.

ADD REPLYlink written 9.8 years ago by Ct586560
gravatar for Larry_Parnell
9.4 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

The assumption is orthologous genes have identical or highly related functions and this sharing is greater than for paralogs. But Nehrt, Hahn et al challenge this by offering that "the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act."

They combined experimentally derived function with gene expression data on nearly 9000 proteins.

This is certainly a controversial statement, but in a thought-provoking manner. After all, it is the integration of diverse data that are driving a lot of genomics. One example is GWAS (genome-wide association studies) + gene expression = better identification of likely causal variant. The same might be applied to the ortholog/paralog definition.

ADD COMMENTlink written 9.4 years ago by Larry_Parnell16k
gravatar for Anuraj Nayarisseri
9.4 years ago by
Anuraj Nayarisseri750 wrote:

Here is an online tool try out this.

ADD COMMENTlink written 9.4 years ago by Anuraj Nayarisseri750
gravatar for Prakki Rama
6.3 years ago by
Prakki Rama2.4k
Prakki Rama2.4k wrote:

Adding one more to the list, a BLAST based approach - Proteinortho for detecting orthology. Authors conclude that the tool requires much less time and memory compared to the earlier tools.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Prakki Rama2.4k

Hi Prakki Rama, I wanna know how to get a 1:1 ortholog using Proteinortho because I got many orthologs had multiple transcripts like below.

# Species   Genes   Alg.-Conn.    
1   3   0.14    Hufu_TR37005|c0_g1_i1|m.38330 Hegu_TRINITY_DN50153_c0_g4_i5|m.67659,Hegu_TRINITY_DN50153_c0_g4_i13|m.67680,Hegu_TRINITY_DN50153_c0_g4_i10|m.67675

1   2   0.248   Hufu_TR100646|c0_g1_i1|m.124281 Hegu_TRINITY_DN37308_c0_g3_i1|m.25758,Hegu_TRINITY_DN37308_c0_g1_i1|m.25757

I have no idea about getting the 1:1 orthologs. Did you have some scripts or some suggestion to do that? I'll really appreciate your help. Thank you.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by haoyan14ioz0

Yes. I faced in the same situation as yours sometime ago. Then I came across reciprocal smallest distance tool, which assigned the best possible ortholog to the query sequences. But still, the target is assigned to multiple query sequence proteins.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Prakki Rama2.4k

Thank you very much. So would it be better I need to get the single transcript to represent this gene first?

ADD REPLYlink written 4.4 years ago by haoyan14ioz0

Orthology/Paralogy is defined at the gene level, not the transcript level. That said you can still have valid 1:many or many:1 relationships. Just be careful about whether you are referring to many genes versus many transcripts.

ADD REPLYlink written 4.4 years ago by DG7.2k

Hi Dan Gaston, thanks a lot. I have finished de novo assembly, used CD-HIT-EST to remove redundant transcripts, then used TransDecoder to predict ORF before prediction of orthologs. As you said, there was still the case one gene included multiple transcripts. Maybe that's another reason why there were several 1:many relationships.

After predicting ORF by Transdecoder, there were still many genes had more than one transcripts like below.

 >Zhze_TR100572|c0_g1_i1|m.127365 TR100572|c0_g1_i1|g.127365  ORF TR100572|c0_g1_i1|g.127365 TR100572|c0_g1_i1|m.127365 type:5prime_partial len:354 (-) TR100572|c0_g1_i1:322-1383(-)
>Zhze_TR100572|c1_g1_i1|m.127368 TR100572|c1_g1_i1|g.127368  ORF TR100572|c1_g1_i1|g.127368 TR100572|c1_g1_i1|m.127368 type:complete len:184 (+) TR100572|c1_g1_i1:287-838(+)

I would like to get the best or longest transcript to represent this gene for the following orthologs prediction, but I have no idea. Would you have some suggestion or some scripts for this problem? I'll really appreciate your help. Thank you.

ADD REPLYlink written 4.4 years ago by haoyan14ioz0

Yes, you will often have the case of a single gene having multiple transcripts. And you're right, selecting one to be representative in the ortholog search is the way to go if you want to use the protein sequence for the search. Selecting the longest is a pretty straightforward task. If you have a FASTA file you could write a small script to organise transcripts by gene and then select the longest sequence as representative, and store all representative transcripts in a separate file.

ADD REPLYlink written 4.4 years ago by DG7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2030 users visited in the last hour