Hi,
To answer some of your questions:
Unfortunately there is not necessarily an one-to-one mapping between Entrez Gene and Ensembl Gene IDs. Although it is improving. As you can read here: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
they are working on consolidating them for human and mouse.
If may even differ per database you use to convert one ID to the other. So if you use the links of Entrez Gene to Ensembl this may give a different mapping than when you use the Ensembl Biomart for converting.
Personally, I would prefer Entrez Gene IDS as they are more stable IDs and more easily to map outdated IDs to current IDs. This is much harder for Ensembl Gene IDs.
Both could be considered standards. Another option is the HGNC symbol, which are more commonly used as the name for a gene.
I might have an e-mail from Ensembl or Entrez Gene that explains how they map their IDs to each other.
---------------------------------------------------------------------------------------------------------------------------------->
I asked the following question to Ensembl:
Regarding the external references provided by Ensembl, I was wondering
how the references to Entrez Gene are retrieved. Also on basis of the
protein sequence?
Answer:
The Ensembl transcript or protein sequence is compared, using BLAST,against Entrezgene databases. In the case of a nucleotide sequence, it's the Ensembl cDNA that is compared.
I replied:
If I look at the external references for ENSG00000196176 (Homo
sapiens), then I get 14 links to EntrezGene.
If I then go to Entrez Gene, then I only see for 8359 (HIST1H4A) the
same reference to Ensembl. The other 13 refer to a different Ensembl
identifier.
I know all these genes encode for the same protein, but I was assuming
the nucleotide sequence is different for all 14 and that therefore
ENSG00000196176 would only be linked to 8359 and not the 13 other
ones.
The answer I got:
The external references need not be perfect matches. As the HIST1H4 records show a high degree of sequence similarity, the 14 will match to the Ensembl record, however not with a 100% id. These are just "close matches". 8359 is the best match and listed first.
This was back in 2009, but it does explain some of the discrepancies.
On the website of Entrez Gene they state the following for the file gene2ensembl they provide on their FTP site:
This file reports matches between NCBI and Ensembl annotation
based on comparison of rna and protein features.
For all organisms, matches are collected as follows.
For a protein to be identified as a match between RefSeq and
Ensembl, there must be at least 80% overlap between the two.
Furthermore, splice site matches must meet certain conditions:
either 60% or more of the splice sites must match, or there may
be at most one splice site mismatch.
For rna features, the matching criteria are the same as for
proteins above. Furthermore, both the rna and the protein features
must meet these minimum matching criteria to be considered a good
match. In addition, only the best matches will be reported in this
file. Other matches that satisified the matching criteria but were
not the best matches will not be reported in this file.
<-----------------------------------------------------------------------------------
Hope it helps,
Gr
Miranda
Just to clarify: Entrez is not a gene database. It's the name of the NCBI infrastructure which provides access to all of the NCBI databases. One of those is the Gene database, so you would say "Entrez Gene".
Done, except for "bijective" => what's the typo there?
This is a very good question. Could you help clarify it a bit by fixing some of the typos: "bijective", "everye", "Ensemble"?
@Untom - I stand corrected, I never had head this word before. I assumed incorreclty it was a typo like the others. My apologies.
@Untom - I stand corrected, I never had heard this word before. I assumed incorreclty it was a typo like the others. My apologies.
@Untom - I stand corrected, I never had heard this word before. I assumed incorrectly it was a typo like the others. My apologies.