Question

What is the meaning of zero-length exons in GENCODE ?

0

Entering edit mode

7.7 years ago

Charles Plessy ★ 2.9k

I found a few zero-length exons in GENCODE... Does anybody know if that is a bug, or if it has a meaning ?

zcat gencode.v25.annotation.gtf.gz | awk '$3 == "exon" && ($5 - $4) == 0' | cut -f1,2,3,4,5,7,9 | cut -c1-65
chr2    ENSEMBL exon    96695297    96695297    +   gene_id "ENSG00000249715.9"
chr2    ENSEMBL exon    166473892   166473892   -   gene_id "ENSG00000136546.
chr4    ENSEMBL exon    1730388 1730388 +   gene_id "ENSG00000013810.18";
chr4    ENSEMBL exon    169663114   169663114   +   gene_id "ENSG00000109572.
chr5    ENSEMBL exon    796064  796064  -   gene_id "ENSG00000188818.12"; t
chr5    HAVANA  exon    88804598    88804598    -   gene_id "ENSG00000081189.14"
chr11   ENSEMBL exon    71580167    71580167    -   gene_id "ENSG00000204571.5
chr11   ENSEMBL exon    76191778    76191778    -   gene_id "ENSG00000085741.1
chr11   ENSEMBL exon    101050949   101050949   -   gene_id "ENSG00000082175
chr14   ENSEMBL exon    24632719    24632719    -   gene_id "ENSG00000100453.1
chr16   ENSEMBL exon    89553267    89553267    +   gene_id "ENSG00000197912.1
chr17   ENSEMBL exon    41624191    41624191    -   gene_id "ENSG00000128422.1
chr17   ENSEMBL exon    43883386    43883386    -   gene_id "ENSG00000108852.1
chr18   ENSEMBL exon    9887458 9887458 +   gene_id "ENSG00000168454.11"
chr19   ENSEMBL exon    49836839    49836839    +   gene_id "ENSG00000104973.1

GENCODE exon • 2.7k views

ADD COMMENT • link updated 7.6 years ago by Mark Thomas ▴ 80 • written 7.7 years ago by Charles Plessy ★ 2.9k

score 4 · Accepted Answer · 2016-09-01

4

Entering edit mode

7.7 years ago

Devon Ryan 104k

Remember that GTF uses 1-based coordinates, so those are 1 base long microexons.

ADD COMMENT • link 7.7 years ago by Devon Ryan 104k

score 4 · Accepted Answer · 2016-09-01

4

Entering edit mode

7.7 years ago

Denise CS ★ 5.2k

This is not a bug. The mode of annotation by Ensembl may allow for exons of any length if there are internal stops in the CDS. If that's the case, the stops will be replaced with introns. You may want to send these examples to the Ensembl helpdesk if you want them to have a second look on these.

ADD COMMENT • link 7.7 years ago by Denise CS ★ 5.2k

1

Entering edit mode

Thanks Denise and Devon. I am now reading about microexons, which I admit have overlooked so far. Perhaps I will contact the helpdesk about cases where the first exon is tiny, because at the moment I do not see how splicing can function in that case. See this example (the only one) where the first exon has a length of 1:

$ zcat gencode.v25.annotation.gtf.gz | grep 'exon_number 1;' | awk '$3 == "exon" && $5 - $4 == 0'
chr5    HAVANA  exon    88804598    88804598    .   -   .   gene_id "ENSG00000081189.14"; transcript_id "ENST00000637754.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "MEF2C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "MEF2C-056"; exon_number 1; exon_id "ENSE00003800555.1"; level 2; tag "RNA_Seq_supported_only"; havana_gene "OTTHUMG00000162634.11"; havana_transcript "OTTHUMT00000490941.1";

For the record, here is the whole transcript:

$ zcat gencode.v25.annotation.gtf.gz | grep 'ENST00000637754.1'
chr5    HAVANA  transcript  88771973    88804598    .   -   .   gene_id "ENSG00000081189.14"; transcript_id "ENST00000637754.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "MEF2C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "MEF2C-056"; level 2; tag "RNA_Seq_supported_only"; havana_gene "OTTHUMG00000162634.11"; havana_transcript "OTTHUMT00000490941.1";
chr5    HAVANA  exon    88804598    88804598    .   -   .   gene_id "ENSG00000081189.14"; transcript_id "ENST00000637754.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "MEF2C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "MEF2C-056"; exon_number 1; exon_id "ENSE00003800555.1"; level 2; tag "RNA_Seq_supported_only"; havana_gene "OTTHUMG00000162634.11"; havana_transcript "OTTHUMT00000490941.1";
chr5    HAVANA  exon    88771973    88772000    .   -   .   gene_id "ENSG00000081189.14"; transcript_id "ENST00000637754.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "MEF2C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "MEF2C-056"; exon_number 2; exon_id "ENSE00003796888.1"; level 2; tag "RNA_Seq_supported_only"; havana_gene "OTTHUMG00000162634.11"; havana_transcript "OTTHUMT00000490941.1";

Edit on January 20th, 2017: corrected a small bug in the one-liner, by adding ";" after "exon_number 1". Results unchanged.

ADD REPLY • link 7.3 years ago by Charles Plessy ★ 2.9k

1

Entering edit mode

ENST00000637754 is a HAVANA transcript based on RNA Seq data only. For this case, it may be quicker to contact HAVANA directly. If there is a mistake with that transcript (and tiny first exon), they will correct it and the revised annotation will be available in Ensembl once Ensembl's annotation gets merged with HAVANA. I'd guess the RNASeq reads they map to the genome did not allow them to extend the 5' end of that model.

ADD REPLY • link 7.6 years ago by Denise CS ★ 5.2k

0

Entering edit mode

At least some of these make more sense if you look at them in the context of the other annotated isoforms. The Mef2c isoform is a processed transcript where in the protein coding isoforms that microexon is much larger. I bet in most of these cases what you're seeing are truncated transcripts that are annotated as "processed transcript" (I still have no real clue what that means).

ADD REPLY • link 7.6 years ago by Devon Ryan 104k

1

Entering edit mode

Processed transcript is a name given by HAVANA to say that the transcript is not coding. Check their help on the VEGA site. The processed transcript can be a lncRNA, a ncRNA or everything else (the unclassified). If in the next rounds of annotation there is further transcriptional evidence to expand that model, then it may be possible to find an ORF and re-classify the processed transcript into something else, coding.

ADD REPLY • link 7.6 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Makes sense, thanks!

ADD REPLY • link 7.6 years ago by Devon Ryan 104k

score 3 · Accepted Answer · 2016-09-08

It looks as though these zero-length exons, which are actually 1bp in length [Thanks, Devon] are the result of poor alignments. With the exception of MEF2C (ENSG00000081189.14), they are all internal exons from the Ensembl genebuild pipeline. As Denise has mentioned the pipeline will tolerate exons of any length, as micro-introns may be introduced to maintain the overall CDS when the alignment is poor. This is apparent in the ZDHHC11-201 transcript (ENSG00000188818; ENST00000424784.3), which contains both a 1bp exon, together with multiple 1-2bp introns. The 1bp micro-exons identified by Charles are therefore most likely artifacts, which will need to be reviewed by HAVANA and Ensembl for the next Gencode release.

The MEF2C example is a truncated transcript [Thanks again, Devon], which was actually intended for an internal study and was accidentally released in Gencode v25. This has now been fully annotated, so the updated transcript will be available in a subsequent release. Denise's reply about the processed_transcript biotype is correct, which was originally used to indicate transcripts that had been 'processed' by the cellular machinery (eg. splicing, poly-adenylation, etc).

While these micro-exons may be artifactual, there are other well-documented examples of alternatively spliced micro-exons, which are often strongly conserved [PMID:25524026, PMID:25525873]. On behalf of Gencode, we welcome any comments or questions, which can be sent to either Ensembl or HAVANA