Confused with Ensembl Exon ID, How to understand them?
1
0
Entering edit mode
8.9 years ago
Yu ▴ 140

Hi, all

Recently, I am working with ensembl GTF annotation files, and try to detect the useful exon I wanted.

I am confused with ensembl Exon ID. For example, the three exons (see below) are belongs to Gene ENSG00000000003 and have the same start site and end site.

chrX    protein_coding  exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000373020"; exon_number "2"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-001"; exon_id "ENSE00003662440";
chrX    processed_transcript    exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000496771"; exon_number "2"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-003"; exon_id "ENSE00003512331";
chrX    processed_transcript    exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000494424"; exon_number "3"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-002"; exon_id "ENSE00003512331";

My questions:

  1. Why the first exon (ENSE00003662440) and last two exons (ENSE00003512331) are annotated with different Exon ID?
  2. Could anybody explain the method of Exon ID annotation? (I don't find any document on ensembl site about the Exon annotation)

Thanks

Exon Ensembl • 6.4k views
ADD COMMENT
0
Entering edit mode

I'm also a bit confused about this GTF files. What does it mean "exon version"? What means the first "1" on every column?

1   havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    13221   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12010   12057   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001948541"; exon_version "1"; tag "basic"; transcript_support_level "NA";
1   havana  exon    12179   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001671638"; exon_version "2"; tag "basic"; transcript_support_level "NA";
1   havana  exon    12613   12697   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001758273"; exon_version "2"; tag "basic"; transcript_support_level "NA";

Thanks in advance!

ADD REPLY
1
Entering edit mode

first "1" on every column is name of the chromosome ( chromosome 1 ). exon_version: The stable identifier version for this exon.

you can find gtf format detail from: ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/README

ADD REPLY
5
Entering edit mode
8.9 years ago
Neilfws 49k

I'd guess part of the explanation is that the transcripts come from different sources and, despite what your GTF file states, have different biotypes.

Take a look at the region here. Transcripts TSPAN6-002 and TSPAN6-003 are Havana transcripts of type "processed transcript". TSPAN6-001 is an Ensembl/Havana merge transcript of type "known protein coding". So the 2 former exons are considered "the same exon"; the latter exon has the same coordinates but a different source so is considered a different exon.

There is some (not detailed) information about the annotation process here. Also note that your data appear to come from genome build GRCh37 and things are somewhat different in the latest build. There are now 5 transcripts of 3 types and consequently 3 exon IDs.

ADD COMMENT
0
Entering edit mode

Thanks a lot! I think it is better to ignore the Exon ID when trying to find the same exon in multiple transcripts.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6