Question

Trouble Extracting Pan-Taxonomic Homologies Using Ensembl Compara Perl Api.

3

Entering edit mode

13.7 years ago

SimonCB765 ▴ 150

I've hit a problem with the Ensembl compara Perl API, and was wondering if anyone had any advice. I have a list of UniProt accessions from human proteins, which I've mapped to Ensembl gene IDs using Ensembl Biomart. I'm using the gene IDs, along with the Ensembl Perl API, to extract the homologies for every gene. I have two almost identical scripts, one which accesses the Ensembl database using the API, and one which accesses the Ensembl Genome database using the API. The script which accesses the Ensembl database runs perfectly with all the gene IDs I have. However, the one accessing the Ensembl Genome database fails for some of the gene IDs. For reference these are a few of the ones that it fails with:

ENSG00000223532 ENSG00000206505 ENSG00000224608 ENSG00000231834 ENSG00000228299 ENSG00000206306 ENSG00000206450 ENSG00000235657 ENSG00000206240 ENSG00000223980 ENSG00000224320 ENSG00000228080 ENSG00000229215

The script accessing Ensembl Genome is attempting to return the homologies in the pan-taxonomic database. I've tracked down the problem to the fact that the lines

my $memberAdaptor = Bio::EnsEMBL::Registry->get_adaptor('pan_homology', 'compara', 'Member');
...
# First get the Member object. As we are searching for homology information this is a gene.
my $member = $memberAdaptor->fetch_by_source_stable_id("ENSEMBLGENE",$ensemblGene);

leave $member undefined. My assumption is that this means the gene ID can't be found in the database. However, when I check the Ensembl Genome website it seems to be able to find the gene IDs. I was under the impression that the IDs were meant to be stable, so I'm a bit confused as to why they can't be found.

Simon

ensembl api bioperl • 3.6k views

ADD COMMENT • link updated 13.7 years ago by Andeyatz ▴ 70 • written 13.7 years ago by SimonCB765 ▴ 150

score 2 · Answer 1 · 2011-11-08

Hi Simon,

As the person who built the Pan-Taxonimic database until recently I hope I can answer your question. What I think has happened here is an explainable difference between the two resources. Ensembl's compara database will contain every Gene as a member since they have two pipelines to assess similarity between proteins; the GeneTree pipeline & the family pipeline. The GeneTree pipeline operates only over what is considered reference to Ensembl (not to be confused with what the GRC considers to be reference) which boils down to no haplotype or patches. The family pipeline runs over every protein so these missing genes are imported.

Ensembl Genomes' pan-taxonomic resource is only a run of the GeneTree pipeline so that does not import genes which are on haplotype regions or those on patch regions.

The only advice I can give is not to assume that every stable ID will result in a Member object from the pan-taxonomic database.

I hope this helps.