Explanation of ENSEMBL GTF features
5
2
Entering edit mode
4.6 years ago
Benn 8.2k

Hi Guys,

I am trying to find more info about the features in the ENSEMBL GTF file, but don't know where to find it. I am using the hg38 GTF file from ENSEMBL, and I am interested in column 3 (feature). More specific I would like to know the exact definition of transcript and gene. Is the transcript including UTR? Is the gene including UTR? Introns?

It seems that there are 9 possible features possible:

awk '{print \$3}' Homo_sapiens.GRCh38.87.gtf | sort | uniq

CDS
exon
five_prime_utr
gene
Selenocysteine
start_codon
stop_codon
three_prime_utr
transcript


Like said I am especially interested in the difference between gene and transcript. If someone could give me the definition or direct me to where it is documented, I would really appreciate it. Thanks.

ensembl annotation gtf • 4.1k views
1
Entering edit mode

Thanks guys for your input. maybe I should have said that I have a PhD in genetics/genomics and work as a bioinformatician for years already. Seen by the answers, that was probably not clear.

This wasn't a newbie question about how simple genetics works, but I couldn't find the definition/criteria used for making the GTF annotation file. Maybe because some of the genes are annotated manually, I don't know.

Thanks anyway!

0
Entering edit mode

Hi b.nota,

This is an old thread but wondering if you were able to have a good reference that answers your question. I'm also not a total newbie but could just use some assistance in the annotation.

Thank you.

0
Entering edit mode

Nothing more than the answers herein, I'm afraid.

1
Entering edit mode
4.6 years ago

These terms are defined in the Sequence Ontology:

http://www.sequenceontology.org/browser/obob.cgi

0
Entering edit mode

Thanks for your help. If I look at the definition of transcript:

An RNA synthesized on a DNA or RNA template by an RNA polymerase.

Duh! But does that include UTR regions? Is the transcript before or after intron splicing?

0
Entering edit mode

The way you solve this is that you go and look up each term: UTR, intron etc.

For each definition, there will be a graphical display that clarifies the hierarchy - what is part of what. For example, on the image you can see clearly that the UTR is part of the transcript.

1
Entering edit mode
4.6 years ago
mastal511 ★ 2.1k

UTR stands for UnTranslated Region, so the UTRs should be included in both the gene and transcript. Introns should not be included in the transcript. Check the coordinates for a few of the genes annotated in the gtf file.

0
Entering edit mode
4.6 years ago
Emily 23k
0
Entering edit mode
9 days ago

Sorry it took a while. By definition, mRNA is 5'-UTR + CDS + 3'-UTR. But in the context of genome coordinates (i.e. in a GTF file) things are less clear. 'transcript' in a GTF file is identical to the primary, unspliced RNA, coordinate-wise. In the EnsEMBL GTF it looks like it's as follows:

Internal, fully translated exons have CDS features with identical coordinates as 'their' exon. For the (partly) untranslated exons (i.e. first and last exon(s) of a transcript), the exons have no or shorter CDS features and instead have UTR features. In other words, the CDS feature in the GTF file means 'this segment is part of the CDS'. I.e. if you string together all the UTR and CDS features you get the mRNA sequence, and this is of course the same as concatenating all exons of a transcript. If you string together just the CDS features and translate them, you get the protein sequence.

I do think this the best way to 'encode' this information in the GTF, but it would be nice if this were somewhere in the EnsEMBL documentation. I guess the confusion arises from conflating the biological CDS with the GTF 'CDS feature' which generally is just a coordinate pair. Hope this clarifies things.

PS: the same confusion may occur for the biological UTR vs the 'UTR feature' in a GTF file, as there may be more than one consecutive exon completely untranslated, all them should have corresponding UTR features.

0
Entering edit mode

CDS and UTRs are what we call discontinuous features (i.e. a single feature that exists over multiple genomic locations).

0
Entering edit mode
9 days ago
Juke34 ★ 6.4k

Feature types (3rd column of hte GTF) and attributes (9th column) used by Ensembl has evolved many times. I had summerized this evolution in a table available here: https://agat.readthedocs.io/en/latest/gxf.html#evolution-of-the-3rd-and-9th-column