Question: How to retrieve the paralogues for a limited taxonomic group
gravatar for Joseph Hughes
4.8 years ago by
Joseph Hughes2.9k
Scotland, UK
Joseph Hughes2.9k wrote:

I am trying to retrieve in an automated fashion for a large number of gene the orthologues of a particular gene ensembl ID as well as the paralogues. Take for example ENSG00000258588, I am retrieving all the orthologues and prologues using

my $geneid="ENSG00000258588"; # ENSP00000346916
# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';

## Get the compara gene member adaptor
my $gene_member_adaptor = $registry->get_adaptor("Multi", "compara", "GeneMember");

## Get the compara member
my $gene_member = $gene_member_adaptor->fetch_by_stable_id($geneid);
my @orthologIDs;
if (defined $gene_member){
  my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Homology');
  my $homologies = $homology_adaptor->fetch_all_by_Member($gene_member);
  my $member_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Member');
  foreach my $homology (@{$homologies}) {
    my @members = @{$homology->get_all_Members()};
    foreach my $this_member (@members) {
      my $orthologID=$this_member->stable_id;

However, the problem is that I end up with far more prologues than I want. Looking at the gene tree, I really only want ENSG00000258588 (TRIM6-TRIM34 ) and ENSG00000258659 (TRIM34). Essentially I want the close paralogues but not the distant paralogues such as TRIM22. In this particular case, I could limit the paralogues to the Ancestral taxonomy Homininae but I do not always know what the Ancestral Taxonomy will be, sometimes it might be Rodentia or Primates depending on when the paralogue arose. I am really only interested in post mammalian divergence events.

An alternative approach I have thought of but I do not know how to implement, is to look for the most recent common ancestor of all the orthologues and then from that node retrieve all the ensembl IDs.

Any pointers, advice, suggestions would be most gratefully appreciated.

ADD COMMENTlink modified 4.8 years ago by Matthieu - Ensembl Compara100 • written 4.8 years ago by Joseph Hughes2.9k
gravatar for Matthieu - Ensembl Compara
4.8 years ago by
Cambridge, UK

Hi Joseph,

The Homology object ($homology) has a taxonomy_level() method that returns the name of the LCA of the pair of genes.

There is also a species_tree_node() method which maps back to a node in the species-tree. Each node has a taxon() method that links to the NCBI-taxonomy, a name(), but you can also directly compare nodes with has_ancestor().

Matthieu, Ensembl Compara

ADD COMMENTlink written 4.8 years ago by Matthieu - Ensembl Compara100

With taxonomy_level() I am going to have to provide a long list of taxonomy names to exclude or include to make sure I have the paralogues I want to include. Where can I find out more about the species_tree_node() method? Thanks

ADD REPLYlink written 4.8 years ago by Joseph Hughes2.9k

If you are interested only in the human lineage there are not that many taxonomy levels: mammalia, theria, eutheria, boroeutheria, euarchontoglires, primates... etc. You can get the names easily from:

Or you can use the newick species tree: to automate the process

ADD REPLYlink written 4.8 years ago by abascalfederico1.1k

I'm not sure what rule you'd like to use. The full doc is here: There are a lot of methods in this module, which can do many things with tree / graph structures

One approach is to do $species_tree_node->taxon()->classification which returns a string like "(...) mammalia theria eutheria boroeutheria euarchontoglires primates (...) homo" and for instance if you want ancestors below the Theria node, you can do: $classification =~ / theria / You can also do $species_tree_node->get_all_ancestors() which returns all the nodes above the current one, and you can check the name, taxon_id, etc of each one of them.

The best way really depends of the filtering you want to apply. You mentioned you only want the most recent paralogue. Is that correct ?

Genomics used to use a different species-tree because we/they wanted to add the "Boreoeutheria" node, but since the NCBI have added it, I think the species-trees are now identical

Matthieu, Ensembl Compara and ex-Genomicus :)

ADD REPLYlink written 4.8 years ago by Matthieu - Ensembl Compara100

Thanks this is all very useful. I will investigate further with these leads. Thanks.

ADD REPLYlink written 4.7 years ago by Joseph Hughes2.9k
gravatar for abascalfederico
4.8 years ago by
abascalfederico1.1k wrote:

You can get all the paralogues and then select only those sharing a last common ancestor with the query gene at certain levels: mammalia, theria, eutheria, etc.

I don't know how the "last common ancestor" information is stored within the homology object in the API, but using Biomart is very easy to get the full list of human genes, their paralogs and the LCAs

ADD COMMENTlink written 4.8 years ago by abascalfederico1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2447 users visited in the last hour