Question

Confused with Ensembl Exon ID, How to understand them?

0

Entering edit mode

8.9 years ago

Yu ▴ 140

Hi, all

Recently, I am working with ensembl GTF annotation files, and try to detect the useful exon I wanted.

I am confused with ensembl Exon ID. For example, the three exons (see below) are belongs to Gene ENSG00000000003 and have the same start site and end site.

chrX    protein_coding  exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000373020"; exon_number "2"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-001"; exon_id "ENSE00003662440";
chrX    processed_transcript    exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000496771"; exon_number "2"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-003"; exon_id "ENSE00003512331";
chrX    processed_transcript    exon    99890555        99890743        .       -       .       gene_id "ENSG00000000003"; transcript_id "ENST00000494424"; exon_number "3"; gene_name "TSPAN6"; gene_biotype "protein_coding"; transcript_name "TSPAN6-002"; exon_id "ENSE00003512331";

My questions:

Why the first exon (ENSE00003662440) and last two exons (ENSE00003512331) are annotated with different Exon ID?
Could anybody explain the method of Exon ID annotation? (I don't find any document on ensembl site about the Exon annotation)

Thanks

Exon Ensembl • 6.4k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Yu ▴ 140

0

Entering edit mode

I'm also a bit confused about this GTF files. What does it mean "exon version"? What means the first "1" on every column?

1   havana  exon    11869   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12613   12721   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    13221   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
1   havana  exon    12010   12057   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001948541"; exon_version "1"; tag "basic"; transcript_support_level "NA";
1   havana  exon    12179   12227   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001671638"; exon_version "2"; tag "basic"; transcript_support_level "NA";
1   havana  exon    12613   12697   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-001"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; havana_transcript "OTTHUMT00000002844"; havana_transcript_version "2"; exon_id "ENSE00001758273"; exon_version "2"; tag "basic"; transcript_support_level "NA";

Thanks in advance!

ADD REPLY • link updated 15 months ago by Ram 43k • written 6.9 years ago by gandrescabrera ▴ 80

1

Entering edit mode

first "1" on every column is name of the chromosome ( chromosome 1 ). exon_version: The stable identifier version for this exon.

you can find gtf format detail from: ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/README

ADD REPLY • link 6.9 years ago by Yu ▴ 140

Ram · Accepted Answer · 2015-05-27

I'd guess part of the explanation is that the transcripts come from different sources and, despite what your GTF file states, have different biotypes.

Take a look at the region here. Transcripts TSPAN6-002 and TSPAN6-003 are Havana transcripts of type "processed transcript". TSPAN6-001 is an Ensembl/Havana merge transcript of type "known protein coding". So the 2 former exons are considered "the same exon"; the latter exon has the same coordinates but a different source so is considered a different exon.

There is some (not detailed) information about the annotation process here. Also note that your data appear to come from genome build GRCh37 and things are somewhat different in the latest build. There are now 5 transcripts of 3 types and consequently 3 exon IDs.