What Is The Best Method To Find Orthologous Genes Of A Species?
10
70
Entering edit mode
12.0 years ago
Ct586 ▴ 610

OP took these paragraphs from Wikipedia:

"Several specialized biological databases provide tools to identify and analyze orthologous gene sequences. These resources employ approaches that can be generally classified into those that are based on all pairwise sequence comparisons (heuristic) and those that use phylogenetic methods. Sequence comparison methods were first pioneered by COGs, now extended and automatically enhanced by the eggNOG database. InParanoid focuses on pairwise ortholog relationships. OrthoDB appreciates that the orthology concept is relative to different speciation points by providing a hierarchy of orthologs along the species tree. Other databases that provide eukaryotic orthologs include OrthoMCL, OrthoMaM for mammals, OrthologID and GreenPhylDB for plants.

Tree-based phylogenetic approaches aim to distinguish speciation from gene duplication events by comparing gene trees with species trees, as implemented in resources such as TreeFam and LOFT. A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct clusters and determine trees, for example Ortholuge, EnsemblCompara GeneTrees and HomoloGene."

I want to know which is the best tool if I want to find ortholog genes of all human genes in primates, mouse.

Thank You!

orthologues • 88k views
1
Entering edit mode

Perhaps it's silly to comment on it so long after the fact, but the first two paragraphs are reproduced verbatim from the Wikipedia article on homology. The question should be edited to show them as a quote, and indicate the source of the quote. I unfortunately don't have enough rep to propose edits to questions.

0
Entering edit mode

Also see discussion here.

86
Entering edit mode
12.0 years ago
jhc ★ 3.0k

As it has been said, there are many options for orthology prediction. The truth is that most of them exist because of a specific reason. In other words, it is difficult to provide an absolute answer to the question "Which is the best one?".

Nevertheless, the following reflexion helps me to decide what works better for each of my projects:

Orthology and paralogy, as originally defined by Fitch, are both evolutionary concepts. This is, orthologous genes are homologous sequences that started to diverge through a speciation event (the same with paralogs and duplication events). Consequently, the better you can approximate the evolution of such sequences, the better your orthology predictions will be.

In this respect, phylogenetic reconstruction is expected to provide you with the best evolutionary view. Therefore, by analyzing the phylogenetic trees (i.e. using tree reconciliation algorithms) it is possible to derive a collection of fine-grained predictions of all orthology relationship among sequences.

However, reconstructing gene phylogenies using the most modern and accurate methods is computationally very intensive (and they are not free of artifacts). As a consequence, this approach is prevented of being used at large scale if you do not count with enough computational power. Generally speaking, if your species of interest are available as precomputed predictions in any phylogeny-based database, is good to try. Otherwise, you can move to alternative methods based on pairwise sequence comparisons. These methods are faster and can usually cope with larger amounts of data.

There is also a third independent alternative that consist of inferring the evolution of genes (and therefore their relationships) based on other genomic features rather than their coding sequence. For instance, the YGOB database can be used to obtain orthology and paralogy predictions based on the gene order conservation among several species. This approach is usually considered as very reliable, and sometimes it is used as a golden-set for benchmarks.

Phylogeny-based analysis will be better choice if (among other reasons):

• you are trying to predict orthology for a very intricate gene family, including many duplications, gene losses, etc.

• you need a fine-grained distinction among, many-to-many, one-to-many and one-to-one relationships.

• you need orthology and paralogy predictions among many species at the same time.

• you want to know about gene losses.

-Note that phylogenetic trees are not perfect. They are not free of artifacts and they can lead (as other methods) to wrong predictions in the case of lineage sorting or horizontal gene transfer.-

Blast-based methods are much faster and provide good results. There are many tools that you can use to generate your own predictions. You will need to decide among them by considering their limitations and specific scope. For instance,

• Do you need a very fast approach to find pairs of orthologs in many species? (Best Reciprocal Hits)

• Is it crucial to differentiate one-to-one orthologs from sequences with in-paralogs? (InParanoid, COG, etc.)

• Do you need cross relationships among more than two species? (MultiParanoid, orthoMCL)

-Note that many of these tools also provide precomputed data.-

## An incomplete summary of resources:

(with special focus on phylogenetic based predictions)

### Phylogeny based methods

• MetaPhOrs (precomputed data): It combines predictions from many different databases and provide a consistency score for each orthology relationship. Useful to find highly reliable predictions. Data can be browsed interactively or downloaded from an FTP sever.

• EnsemblCompara (precomputed data): Phylogeny based orthology and paralogy predictions. Ensembl bases its predictions in the analysis of gene family trees reconstructed using TreeBest (PhyML with fixed evolutionary model, DNA and protein analysis, slighted guided trees for better tree reconciliation).

• PhylomeDB (precomputed data): It bases its predictions in a per-gene phylogenetic analysis (PhylML testing several evolutionary models and alignment timing and optimization). Note that, while Ensembl is a general purpose database, PhylomeDB is organized in "phylomes", which are genome wide collections of trees whose taxon sampling and analysis design is usually hypothesis driven. From the publication on Metaphors, PhylomeDB uses Metaphors to measure the reliability of their phylome-based predictions.

In general terms, Both Ensembl and phylomeDB tend to benchmark very similar (with good results) and they provide convenient API access to the DB and FTP downloads.

• TreeFam (precomputed data): Similar to EnsemblCompara but it includes a set of manually curated trees. It seems to be discontinued, latest release dates from Feb 2009.

• PHOG, analysis of precomputed phylogenies using a slightly different method.

### Blast-based approaches

• Inparanoid (precomputed data and standalone application): Predictions between pairs of species. It accounts for one-to-many and many-to-many relationships.

• EggNOG (~COG) (precomputed data): Comprehensive catalog (630 species, including bacteria and archaea) of functionally annotated orthologs groups. An all-against-all blast comparison is used to build the orthologs groups. It accounts for in-paralogs.

• OrthoMCL, MultiParanoid: Extensions of the previous methods. They add the possibility of generate predictions of several species at the same time.

• Best Reciprocal hits (BRH): The simplest method. Still very useful when only the best orthologus pairs between two species are required.

## Some benchmarks (among others)

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018755

http://genomebiology.com/2007/8/6/R109 (figure 4)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838432/

Human, other primates and mouse are well represented in MetaPhOrs (combining Ensembl and other DB predictions). Following the above reasoning, I would give it a try.

ftp://phylomedb.org/metaphors/release-201009/orthologs/HUMAN_orthologs.txt.tar.gz

Hope it helps!!

1
Entering edit mode

+1. Very comprehensive and pretty precise answer.

0
Entering edit mode

0
Entering edit mode

One comment (I just came across this Q&A after a similar question was asked recently) is that phylogenetic reconstruction works best for identifying deep paralogy where paralogs are cleary separated into unique sub-trees. But there are issues of inparalogs and coorthologs as well where things get a lot trickier.

Doing a de novo method like OrthoMCL can really help here when you have lots of lineage specific duplications, meaning orthologs and paralogs are interleaved throughout the tree.

14
Entering edit mode
12.0 years ago
lh3 33k

yhc gave a wonderful answer. I wanted to comment on his/hers, but hit the word limit.

1) Definition of orthologs. Fitch's definition is the most widely accepted. IMO, it is also more precise and evolutionarily meaningful than the several alternatives. If you want to find orthologs, go for databases using such a definition (e.g. Ensembl, TreeFam and InParanoid).

2) In general, I prefer tree-based method, especially for mammalians and perhaps also vertebrates. With a tree you can visually tell if the inference makes sense, which is a huge advantage. Another advantage of tree-based methods over pairwise methods is that tree-based methods produce consistent results across species. For example, say A is a 1:1 ortholog of B and B is a 1:1 ortholog of C. In principle, A is a 1:1 ortholog to C (not true if not 1:1), but a pairwise method cannot always guarantee this.

3) However, tree-based methods are not necessarily better than other methods. Reconstructing trees is very difficult. It is quite possible to come up with a purely heuristic method to achieve better results.

4) For tree-based methods, it is important to build gene trees considering species tree, or try to fix the tree topology with the species tree as sort of a prior. Blindly building a gene tree (even using the best algorithm) and then do the standard reconciliation will give very bad inference.

5) Tree-based methods do not work well for bacteria due to the lack of a good species tree and LGT/HGT. LGT very rarely, if ever, happens to mammalians.

6) For mammalians, nucleotide trees tend to reflect the true evolution in comparison to protein trees. A paper is arguing a protein guided nucleotide alignment is the best for building trees. This is also my experiences. Ensembl/TreeFam are using that.

7) For primates and rodents, EnsemblCompara is probably the best choice. It may not be the most accurate, but should be good enough for most purposes. I usually do not like to take the results by combining predictions. It is good for method comparison, but leads to various artifacts that are hard to understand.

1
Entering edit mode

For 1:1 ortholog, transitivity always stands.

0
Entering edit mode

My pre-work used Ensembl data, I will use EnsemblCompara to do it first. Thank you! I learned a lot through your words.

BTW, it is jhc not yhc whom you commented.

0
Entering edit mode

'For example, say A is a 1:1 ortholog of B and B is a 1:1 ortholog of C. In principle, A is a 1:1 ortholog to C (not true if not 1:1)'. I would be very careful with that as orthology is non transitive term! However it's true, that in case of one-to-one orthologs that's often true.

10
Entering edit mode
12.0 years ago
Michi ▴ 990

I doubt that ANYBODY knows which is the best. But if you want to do it systematically, you may try the web-API of EnsemblCompara you mentioned

Take a look at the tutorial and read the chapter Homologies and Protein clusters

0
Entering edit mode

Thank you! I will try this!

7
Entering edit mode
12.0 years ago

Another option is Phylofacts orthology groups (PHOG). This resource provides a partial answer to your question, since they have performed accuracy assessment analyses using a sample of 100 Treefam proteins as a reference in their 2009 NAR paper:

Results on a benchmark dataset from the TreeFam-A manually curated orthology database show that PHOG provides a combination of high recall and precision competitive with both InParanoid and OrthoMCL, and allows users to target different taxonomic distances and precision levels through the use of tree-distance thresholds. For instance, OrthoMCL-DB achieved 76% recall and 66% precision on this dataset; at a slightly higher precision (68%) PHOG achieves 10% higher recall (86%). InParanoid achieved 87% recall at 24% precision on this dataset, while a PHOG variant designed for high recall achieves 88% recall at 61% precision, increasing precision by 37% over InParanoid.

The authors provide precision-recall curves for different methods, e.g...

...and a more detailed also provide a more detailed description of their assessment here.

EDIT:

See also Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes Chen F, Mackey AJ, Vermunt JK, Roos DS. PLoS One. 2007 Apr 18;2(4):e383.

0
Entering edit mode

Thank you! Great paper!

0
Entering edit mode

wow I have never seen this paper and graph, good to know that some body has done such a comparison

4
Entering edit mode
12.0 years ago
Treylathe ▴ 950

Agreed with the above, not sure it would be easy to assess which is best, but the ones you named are good:

You might want to try out MetaPhOrs.

1
Entering edit mode

Thank you ! Great blog by the way!

2
Entering edit mode
12.0 years ago
Philippe ★ 1.9k

Hello,

I also think there is no best methods. I guess you should take in account what you really need for your project. Depending on the analyses you might want to perform later and the biological insights behind it maybe some methods are more suitable than other. You also said you wanted to retrieve human orthologs. In which species? Mammalian species only or more distant ones? Depending on the answers to these question maybe you could focus on a shorter list of databases using some specific methods.

If you really hesitate between different resources one good solution might be to build several gene sets based on different databases. You can then assess the differences between them (that might not be that huge) and have a more precise view of your data. Many databases offer the possibility to easily download/access some preprocessed data. Building different sets of orthologous groups shouldn't ask you too much time.

0
Entering edit mode

Thank you! This helps a lot. I have also had the idea to compare different methods. You help me put my foot down.

2
Entering edit mode
12.0 years ago

One more for your list: OMA browser

The question 'which data-source is best' is rather a moot point. More important is what evidence can be gathered to determine the likelihood of a specific ortholog prediction. e.g. is there synteny evidence? Does the gene match the species tree? Is the data only a bi-directional best hit?

The easiest predictions are when there has been no gene duplication event since the last common ancestor (LCA) of the genes you are examining. Genetreeview in ensembl compara should help to determine this. The hardest is in highly paralogous families that are known to undergo recombination / concerted evolution. e.g. GPCR's / Protein kinases. For the latter look for specialised databases which focus on them, particularly focusing on their active sites (otherwise you can focus on a protein domain from the LCA that has a different evolutionary history than the active site).

0
Entering edit mode

Thank you for your analysis. I leanred a lot and will rethink it.

1
Entering edit mode
11.6 years ago

The assumption is orthologous genes have identical or highly related functions and this sharing is greater than for paralogs. But Nehrt, Hahn et al challenge this by offering that "the most important factor in the evolution of function is not amino acid sequence, but rather the cellular context in which proteins act."

They combined experimentally derived function with gene expression data on nearly 9000 proteins.

This is certainly a controversial statement, but in a thought-provoking manner. After all, it is the integration of diverse data that are driving a lot of genomics. One example is GWAS (genome-wide association studies) + gene expression = better identification of likely causal variant. The same might be applied to the ortholog/paralog definition.

1
Entering edit mode
11.6 years ago

Here is an online tool try out this.

http://oxytricha.princeton.edu/BlastO/

1
Entering edit mode
8.5 years ago
Prakki Rama ★ 2.6k

Adding one more to the list, a BLAST based approach - Proteinortho for detecting orthology. Authors conclude that the tool requires much less time and memory compared to the earlier tools.

0
Entering edit mode

Hi Prakki Rama, I wanna know how to get a 1:1 ortholog using Proteinortho because I got many orthologs had multiple transcripts like below.

# Species   Genes   Alg.-Conn.  trinity_15BJ05.cd-hit.fasta.transdecoder.pep    trinity_60.cd-hit.fasta.transdecoder.pep
1   3   0.14    Hufu_TR37005|c0_g1_i1|m.38330 Hegu_TRINITY_DN50153_c0_g4_i5|m.67659,Hegu_TRINITY_DN50153_c0_g4_i13|m.67680,Hegu_TRINITY_DN50153_c0_g4_i10|m.67675

1   2   0.248   Hufu_TR100646|c0_g1_i1|m.124281 Hegu_TRINITY_DN37308_c0_g3_i1|m.25758,Hegu_TRINITY_DN37308_c0_g1_i1|m.25757


I have no idea about getting the 1:1 orthologs. Did you have some scripts or some suggestion to do that? I'll really appreciate your help. Thank you.

0
Entering edit mode

Yes. I faced in the same situation as yours sometime ago. Then I came across reciprocal smallest distance tool, which assigned the best possible ortholog to the query sequences. But still, the target is assigned to multiple query sequence proteins.

0
Entering edit mode

Thank you very much. So would it be better I need to get the single transcript to represent this gene first?

0
Entering edit mode

Orthology/Paralogy is defined at the gene level, not the transcript level. That said you can still have valid 1:many or many:1 relationships. Just be careful about whether you are referring to many genes versus many transcripts.

0
Entering edit mode

Hi Dan Gaston, thanks a lot. I have finished de novo assembly, used CD-HIT-EST to remove redundant transcripts, then used TransDecoder to predict ORF before prediction of orthologs. As you said, there was still the case one gene included multiple transcripts. Maybe that's another reason why there were several 1:many relationships.

After predicting ORF by Transdecoder, there were still many genes had more than one transcripts like below.

 >Zhze_TR100572|c0_g1_i1|m.127365 TR100572|c0_g1_i1|g.127365  ORF TR100572|c0_g1_i1|g.127365 TR100572|c0_g1_i1|m.127365 type:5prime_partial len:354 (-) TR100572|c0_g1_i1:322-1383(-)
TMTSAILRRNSSKQGLQNLIRLTAQWSVEDEEEAARERRRREREKQLRSQAEEGLNGTVS
CSESAALAQENHYDFKPSGTSELEEDEGFSDWSQKLEQRKQRSPRQSYEEENSGVREAEV
KLEQIQLDQECLEEKMVGREEGRLCQEEEEAQEQEEGEQAEQEEKKRRRNDGGKEEETPE
KRQKAPSLASLEEEELCSDHTAVCSTKITDRTESLNRSIQKSNSIKRSQPPLPVSKIDDR
LEQYTQAIETSTKAPKPVRQPSLDLPTTSMMVASTKSLWETGEVTAQSAVKPLACKDIVA
GDIVSKRSLWEQKGNPKPESSIKSIHPSGKKYKFVATGHGQYKKVLIDDAAEQ*
>Zhze_TR100572|c1_g1_i1|m.127368 TR100572|c1_g1_i1|g.127368  ORF TR100572|c1_g1_i1|g.127368 TR100572|c1_g1_i1|m.127368 type:complete len:184 (+) TR100572|c1_g1_i1:287-838(+)
MSDEEKKRRAATARRQHLKSAMLQLAATEIEKEAAAKEVEKQNYLAEHCPPLSLPGSMQE
AMLRALLGSKHKVCMDLRANLKQVKKEDTEKEKDLRDVGDWRKNIEEKSGMEGRKKMFEA
GES*


I would like to get the best or longest transcript to represent this gene for the following orthologs prediction, but I have no idea. Would you have some suggestion or some scripts for this problem? I'll really appreciate your help. Thank you.

0
Entering edit mode

Yes, you will often have the case of a single gene having multiple transcripts. And you're right, selecting one to be representative in the ortholog search is the way to go if you want to use the protein sequence for the search. Selecting the longest is a pretty straightforward task. If you have a FASTA file you could write a small script to organise transcripts by gene and then select the longest sequence as representative, and store all representative transcripts in a separate file.