As it has been said, there are many options for orthology
prediction. The truth is that most of them exist because of a specific
reason. In other words, it is difficult to provide an absolute answer
to the question "Which is the best one?".
Nevertheless, the following reflexion helps me to decide what works
better for each of my projects:
Orthology and paralogy, as originally defined by Fitch, are both
evolutionary concepts. This is, orthologous genes are homologous
sequences that started to diverge through a speciation event (the same
with paralogs and duplication events). Consequently, the better you
can approximate the evolution of such sequences, the better your
orthology predictions will be.
In this respect, phylogenetic reconstruction is expected to provide
you with the best evolutionary view. Therefore, by analyzing the
phylogenetic trees (i.e. using tree reconciliation
algorithms) it is possible to derive a collection of fine-grained
predictions of all orthology relationship among sequences.
However, reconstructing gene phylogenies using the most modern and
accurate methods is computationally very intensive (and they are not
free of artifacts). As a consequence, this approach is prevented of
being used at large scale if you do not count with enough
computational power. Generally speaking, if your species of interest
are available as precomputed predictions in any phylogeny-based
database, is good to try. Otherwise, you can move to alternative
methods based on pairwise sequence comparisons. These methods are
faster and can usually cope with larger amounts of data.
There is also a third independent alternative that consist of
inferring the evolution of genes (and therefore their relationships)
based on other genomic features rather than their coding sequence. For
instance, the YGOB database can be used to obtain orthology and paralogy
predictions based on the gene order conservation among several
species. This approach is usually considered as very reliable, and
sometimes it is used as a golden-set for benchmarks.
Phylogeny-based analysis will be better choice if (among other
you are trying to predict orthology for a very intricate gene family,
including many duplications, gene losses, etc.
you need a fine-grained distinction among, many-to-many,
one-to-many and one-to-one relationships.
you need orthology and paralogy predictions among many species at
the same time.
you want to know about gene losses.
-Note that phylogenetic trees are not perfect. They are not free of artifacts and they can lead (as other methods) to wrong predictions in the case of lineage sorting or horizontal gene transfer.-
Blast-based methods are much faster and provide good
results. There are many tools that you can use to generate your own
predictions. You will need to decide among them by considering
their limitations and specific scope. For instance,
Do you need a very fast approach to find pairs of orthologs in
many species? (Best Reciprocal Hits)
Is it crucial to differentiate one-to-one orthologs from sequences
with in-paralogs? (InParanoid, COG, etc.)
Do you need cross relationships among more than two species?
-Note that many of these tools also provide precomputed data.-
An incomplete summary of resources:
(with special focus on phylogenetic based predictions)
Phylogeny based methods
MetaPhOrs (precomputed data): It combines predictions from many
different databases and provide a consistency score for each
orthology relationship. Useful to find highly reliable
predictions. Data can be browsed interactively or downloaded from an
EnsemblCompara (precomputed data): Phylogeny based orthology and
paralogy predictions. Ensembl bases its predictions in the analysis
of gene family trees reconstructed using TreeBest (PhyML with fixed
evolutionary model, DNA and protein analysis, slighted guided trees
for better tree reconciliation).
PhylomeDB (precomputed data): It bases its predictions in a
per-gene phylogenetic analysis (PhylML testing several evolutionary
models and alignment timing and optimization). Note that, while
Ensembl is a general purpose database, PhylomeDB is organized in
"phylomes", which are genome wide collections of trees whose taxon
sampling and analysis design is usually hypothesis driven. From the
publication on Metaphors, PhylomeDB uses Metaphors to measure the
reliability of their phylome-based predictions.
In general terms, Both Ensembl and phylomeDB tend to benchmark very
similar (with good results) and they provide convenient API access
to the DB and FTP downloads.
TreeFam (precomputed data): Similar to EnsemblCompara but it
includes a set of manually curated trees. It seems to be
discontinued, latest release dates from Feb 2009.
PHOG, analysis of precomputed phylogenies using a slightly
Inparanoid (precomputed data and standalone application):
Predictions between pairs of species. It accounts for one-to-many
and many-to-many relationships.
EggNOG (~COG) (precomputed data): Comprehensive catalog (630 species,
including bacteria and archaea) of functionally annotated
orthologs groups. An all-against-all blast comparison is used to
build the orthologs groups. It accounts for in-paralogs.
OrthoMCL, MultiParanoid: Extensions of the previous methods. They
add the possibility of generate predictions of several species at
the same time.
Best Reciprocal hits (BRH): The simplest method. Still very useful when
only the best orthologus pairs between two species are required.
Some benchmarks (among others)
http://genomebiology.com/2007/8/6/R109 (figure 4)
http://nar.oxfordjournals.org/content/37/suppl_2/W84.full (Figure 1)
http://nar.oxfordjournals.org/content/early/2010/12/11/nar.gkq953.full (Figure 3)
Human, other primates and mouse are well represented in MetaPhOrs
(combining Ensembl and other DB predictions). Following the above
reasoning, I would give it a try.
Hope it helps!!