Question

A Quick Way To Match A List With Ensembl Gene Ids To The Specific Ensembl Db Version?

1

Entering edit mode

11.5 years ago

John Van Dam ▴ 110

Hi all,

I received a list of ensembl genes (±1000) from a collaborator which I need to process. In order to do so I need to know which EnsEMBL version they used (it is not the latest version 68). I asked them, but I suspect they may not know either (Always... always mention database version in your research papers!!). I need to stay in the same EnsEMBL version for our collective sanity.

What I would like is to obtain the latest (but I'll settle for any) EnsEMBL version that contains 100% of my identifiers. I could not find any scripts or tools that can do this automatically, nor did I find anything in the EnsEMBL perl API documentation that could help me. Does anyone know of a method or tool to match my gene list to the appropriate EnsEMBL version automatically?

Many thanks!

John

ensembl identifiers database • 3.7k views

ADD COMMENT • link updated 11.5 years ago by Andy Yates ▴ 120 • written 11.5 years ago by John Van Dam ▴ 110

0

Entering edit mode

That's a tricky situation. You should check with your collaborator if they know NCBI / hg number also you can get corresponding ENSEMBL release IDs.

ADD REPLY • link 11.5 years ago by Khader Shameer 18k

score 3 · Answer 1 · 2012-10-20

I knocked together this quick bash script which might help:

It takes a text file of ids (one per line) as a parameter, and then counts the number of ids that exist in all available Ensembl databases

The only tricky bit is the schema change around v66 (where stable_id was de-normalised into the gene table)

#!/bin/bash

if [[ $# -ne 2 ]]
then
  echo "Need 2 parameters, a species (ie: homo_sapiens) and a file containing Ensembl Ids"
  exit -1
fi

SPECIES=$1
FILENAME=$2

DATABASES=( $(mysql -h ensembldb.ensembl.org -s -P 5306 -u anonymous -e "show databases" | grep "${SPECIES}_core_") )

echo "Found ${#DATABASES[@]} core databases for ${SPECIES}"

echo "Reading from $FILENAME"
LINECOUNT=0
IDS="'"
# Read the file, and build an sql string from it (used later)
for LINE in `cat $FILENAME`
do
    if [ $IDS != "'" ]
    then
        IDS="$IDS','"
    fi
    IDS="$IDS$LINE"
    LINECOUNT=`expr $LINECOUNT + 1`
done
IDS="$IDS'"

echo "Read $LINECOUNT ids from $FILENAME"

for DB in "${DATABASES[@]}"
do
    MAJ=( $(echo $DB | sed 's/_/ /g' | cut -d ' ' -f 4 -) )
    SQL="SELECT COUNT(*) FROM gene WHERE stable_id IN ($IDS)"
    if [ $MAJ -lt 67 ]
    then
      SQL="SELECT COUNT(gene.gene_id) FROM gene JOIN gene_stable_id USING( gene_id ) WHERE gene_stable_id.stable_id IN ($IDS)"
    fi
    ROWS=`mysql -s -h ensembldb.ensembl.org -P 5306 -u anonymous $DB -e "$SQL"`
    echo "In Ensembl version $MAJ ($DB) found $ROWS out of $LINECOUNT"
done

I tried it with 2000 random Ensembl IDs taken from v63, and got:

$ ./scan.sh homo_sapiens ids.txt 
Found 22 core databases for homo_sapiens
Reading from ids.txt
Read 2000 ids from ids.txt
In Ensembl version 48 (homo_sapiens_core_48_36j) found 919 out of 2000
In Ensembl version 49 (homo_sapiens_core_49_36k) found 919 out of 2000
In Ensembl version 50 (homo_sapiens_core_50_36l) found 958 out of 2000
In Ensembl version 51 (homo_sapiens_core_51_36m) found 962 out of 2000
In Ensembl version 52 (homo_sapiens_core_52_36n) found 977 out of 2000
In Ensembl version 53 (homo_sapiens_core_53_36o) found 977 out of 2000
In Ensembl version 54 (homo_sapiens_core_54_36p) found 977 out of 2000
In Ensembl version 55 (homo_sapiens_core_55_37) found 1503 out of 2000
In Ensembl version 56 (homo_sapiens_core_56_37a) found 1663 out of 2000
In Ensembl version 57 (homo_sapiens_core_57_37b) found 1677 out of 2000
In Ensembl version 58 (homo_sapiens_core_58_37c) found 1828 out of 2000
In Ensembl version 59 (homo_sapiens_core_59_37d) found 1830 out of 2000
In Ensembl version 60 (homo_sapiens_core_60_37e) found 1877 out of 2000
In Ensembl version 61 (homo_sapiens_core_61_37f) found 1921 out of 2000
In Ensembl version 62 (homo_sapiens_core_62_37g) found 1960 out of 2000
In Ensembl version 63 (homo_sapiens_core_63_37) found 2000 out of 2000
In Ensembl version 64 (homo_sapiens_core_64_37) found 1981 out of 2000
In Ensembl version 65 (homo_sapiens_core_65_37) found 1979 out of 2000
In Ensembl version 66 (homo_sapiens_core_66_37) found 1967 out of 2000
In Ensembl version 67 (homo_sapiens_core_67_37) found 1961 out of 2000
In Ensembl version 68 (homo_sapiens_core_68_37) found 1903 out of 2000
In Ensembl version 69 (homo_sapiens_core_69_37) found 1902 out of 2000

score 1 · Answer 2 · 2012-10-17

Hi John,

I think that what you are currently looking for does not exist. It may be easier just to try several possibilities and keep the one that gives you more successful conversions. I doubt you get 100% since the conversion is not perfect.

Another thing you could try is the new REST API http://beta.rest.ensembl.org/

I'm not quite sure it works with previous versions...

Good luck!

score 1 · Answer 3 · 2012-10-23

Hi John,

Firstly the REST API does not support previous versions and is something we are currently discussing how best to support archive versions. Once we come up with a way to do this we will let everyone know.

Secondly you can do this lookup using the latest Ensembl DB since we retain all stable ID history with every Ensembl core database release. Human, for example, tracks all the way back to 2002. If you are in Perl then use the ArchiveStableIdAdaptor to search for your IDs. There's a good few examples on the documentation pages about how to work with it. If you're not in Perl then you will want to look at the tables

mapping_session
stableidevent
gene_archive

The mapping session declares what release we mapped from & to, stable id event records what happened to the ID and gene_archive is where our retired genes live. It's quick enough to look for your stable id in the archive table and if it's not there then it is still active. However I must urge caution when using Ensembl stable ids especially if you are using human. Human goes through a merge with Havana manual annotation every release meaning that a locus can change in structure quite significantly whilst still retaining the overall structure of the original gene's transcript splicing model (gene stable ids are assigned/mapped based on their transcript identity). That does mean that a multi-transcript gene's identifier can be active, located in the correct location but be attached to the "wrong" locus. One way to avoid this I've been suggesting to users is to switch from thinking about genes to using transcripts. They are easier to map between releases and have a version attached which is incremented if there is a sequence difference.

I hope this helps you out or at least makes you aware of some pitfalls.

score 0 · Answer 4 · 2012-10-22

0

Entering edit mode

11.5 years ago

John Van Dam ▴ 110

Thanks Biojl and Tim! I'll give both a try. If I stumble onto something that works I'll post it here. Thanks for the help

ADD COMMENT • link 11.5 years ago by John Van Dam ▴ 110

1

Entering edit mode

Updated my answer to scan all available DBs, and tested it with 2000 ids (seems to work)... Fingers crossed! ;-)

ADD REPLY • link 11.5 years ago by tim.yates ▴ 40