GTF format feature definitions
2
0
Entering edit mode
3.1 years ago

I am trying to analyse some GTF annotation files from a Braker2 run but am not fully comprehending the definitions of the features within the feature column. I know individual what they each are outside of GTF but when looking within the files I am getting confused.

For example I see a feature labelled gene but only have a length of 42:

PseudaA_3172    AUGUSTUS    intron  1   7   0.77    -   .   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    CDS 8   43  0.42    -   0   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    gene    1   43  0.42    -   .   g18847
PseudaA_3172    AUGUSTUS    transcript  1   43  0.42    -   .   g18847.t1
PseudaA_3172    AUGUSTUS    exon    8   43  .   -   .   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";
PseudaA_3172    AUGUSTUS    start_codon 41  43  .   -   0   transcript_id "file_1_file_1_g18847.t1"; gene_id "file_1_file_1_g18847";

Where can i become informed about the exact definitions of each feature and why am I seeing supposedly gene lengths that are this short?

My understanding of the GTF format was that they are hierarchical, so every CDS/intron/exon ect will be contained within the length of a parent gene feature which we see here, but this must be wrong?

annotation • 1.2k views
ADD COMMENT
2
Entering edit mode
3.1 years ago

Your understanding is correct. This GTF fragment you have posted does indeed represent a 43bp long gene. These genes come from AUGUSTUS, which is a gene prediction program, so I'd guess this is probably a false positive.

Unfortunately there isn't a formal definition of the GTF format (and as such it isn't a "standard"). This goes doubly for the features column, which people use pretty much however they see if.

ADD COMMENT
0
Entering edit mode

Thank you for this and that would make sense that it is a false positive. It is a shame it is not standardised!

Do you have any advice on "thresholds" for gene length cut-offs? This will of course be very species dependent but how would I begin to estimate this?

ADD REPLY
0
Entering edit mode

I don't know what the standard is now, but in the old days you would get rid of anything where the CDS was less than 30 amino acids (or 90bp). That might have been for prokaryotes as well.

I would also discard anything that didn't have a complete start and stop codon and at least some t5' and 3' UTR

ADD REPLY
0
Entering edit mode
3.1 years ago
Juke34 8.5k

There is several standards for the GTF format see here

But it is more about the syntax. There is no specification such as intron length or gene length. Augustus can predict partial genes, and it is the case here, you only have the first exon (see the intron finishing at the beginning of the sequence (minus strand!)).... so the real gene might be longer.

ADD COMMENT

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6