Question: Gencode V15 Exons < 3Bp
1
gravatar for PoGibas
6.2 years ago by
PoGibas4.7k
Vilnius
PoGibas4.7k wrote:

After checking exons length from the gencode.v15.annotation I have noticed that there are exons only 1bp or 2bp in length.

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz" | 
     gunzip -c | 
     awk '($3=="exon" && $5-$4+1 < 3) {print}'

Thats strange as some exons (protein coding or lncRNA are only 1bp or 2bp long). Is it bioinformatics or (probably not) biology? Has anyone ever noticed something like that with different annotation?

encode • 1.6k views
ADD COMMENTlink modified 5.2 years ago by Emily_Ensembl17k • written 6.2 years ago by PoGibas4.7k
1

The coordinates in a gtf are inclusive. So you should $5 - $4 + 1. So the lengths are actually 1. Still pretty weird that you get exon length of 1 though.

ADD REPLYlink written 6.2 years ago by Damian Kao15k

Some non-coding RNAs shared in protein coding genes are marked with 0 or 1 length.

ADD REPLYlink written 6.2 years ago by JC7.6k

Thanks, fixed it.

ADD REPLYlink written 5.3 years ago by PoGibas4.7k
3
gravatar for PoGibas
5.3 years ago by
PoGibas4.7k
Vilnius
PoGibas4.7k wrote:

I have contacted and asked Gencode staff about this issue (in February). They have answered and hoped that problem will be fixed until Gencode.v16.

Apparently there was a bug in one of their scripts.
"... there should be no exons in Gencode <3bp. Alignments of <3bp can not be trusted, even when spanning known splice junctions, or confirming known UTRs/retained introns".

Current Gencode annotation (v18) still have this problem (don't know why they haven't fixed it yet).
I would suggest filtering those exons out.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by PoGibas4.7k
0
gravatar for Emily_Ensembl
5.2 years ago by
Emily_Ensembl17k
EMBL-EBI
Emily_Ensembl17k wrote:

Here's what Laurens says now:

I had a look at a couple of examples:

  • OTTHUMT00000321563 has 2bp first (coding) exon because it is 5' incomplete and those two bases align to a reference exon. Though arguably they could also align to the exon before that and other more upstream exons. I have now deleted that exon.
  • OTTHUMT00000470867 doesn't have 1 bp exon in our internal database any more, it's 227 long now. So that should be in a future Ensembl update.

I will go through the short-exon list from Gencode v18 and fix where necessary.

ADD COMMENTlink written 5.2 years ago by Emily_Ensembl17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1836 users visited in the last hour