Question: Hg37: Bad Annotations From Ensembl?
2
gravatar for Pablo
7.8 years ago by
Pablo1.9k
Canada
Pablo1.9k wrote:

I've recently noticed weird entries in hg37.63 from Ensembl. As an example, here is the first exon of trancript ENST00000310701:

1       protein_coding  exon    148025761       148025848       .       -       .        gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001";
1       protein_coding  CDS     148025761       148025848       .       -       2        gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001"; protein_id "ENSP00000309907";

This seems to be a protein coding transcript. Exon and CDS start and end at the same position, which means there is no UTR.

Here is the weird part: If you query Ensembl for variants at the start position and one base before, you get

Uploaded Variation  Location    Allele  Gene    Feature Feature type    Consequence Position in cDNA    Position in CDS Position in protein Amino acid change   Codon change    Co-located Variation    Extra
1_148025849_A   1:148025849 A   ENSG00000122497 ENST00000310701 Transcript  UPSTREAM    -   -   -   -   -   -   -
1_148025848_A   1:148025848 A   ENSG00000122497 ENST00000310701 Transcript  SYNONYMOUS_CODING   1   2   1   X   nAa/nTa -   -

So, the start base (148025848) is the SECOND base of the first codon. If you take a detailed look at the GTF definition, you'll notice a '2' on the 'frame' column.

The question is: Considering that the transcript has no UTR, is there a valid reason for the first base of the first exon to be the second base of the CDS?

I guess an alternative question is: Am I incorrect in the interpretation of this data or this looks like a bug?

genome snp • 2.3k views
ADD COMMENTlink modified 7.8 years ago by Bert Overduin3.6k • written 7.8 years ago by Pablo1.9k

According to my interpretation of this GTF 2.2 specification (http://mblab.wustl.edu/GTF22.html), the "frame" calculation on these transcripts seems to be incorrect.

ADD REPLYlink written 7.8 years ago by Pablo1.9k

It looks like there are around 5000 transcripts in hg37.63 that may have a similar problem.

ADD REPLYlink written 7.8 years ago by Pablo1.9k
4
gravatar for Bert Overduin
7.8 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

Pablo,

If you have a look at http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?g=ENSG00000122497;r=1:148003642-148025848;t=ENST00000310701, you can see that this transcript (annotated by the Havana team) is inclomplete at the 5' end and starts at the second base of a codon. So, that should explain your observation.

Hope this helps.

By the way, it's either GRCh37 or hg19, but not hg37 .... ;)

Cheers, Bert

ADD COMMENTlink written 7.8 years ago by Bert Overduin3.6k

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations. I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLYlink written 7.8 years ago by Pablo1.9k

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations.

I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLYlink written 7.8 years ago by Pablo1.9k

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations. [?] I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLYlink written 7.8 years ago by Pablo1.9k
1
gravatar for Sander Timmer
7.8 years ago by
Sander Timmer700
United Kingdom
Sander Timmer700 wrote:

Without answering your question I have one advise for you. Ensembl has a dedicated Helpdesk team which you can email about questions or possible bugs. Just tell them what you did and what kind of result you expected.

You can contact them at http://www.ensembl.org/info/about/contact/index.html

ADD COMMENTlink written 7.8 years ago by Sander Timmer700
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 894 users visited in the last hour