Question: Second column in Ensembl GTF: source or biotype?
gravatar for alex.rubinsteyn
5.9 years ago by
United States
alex.rubinsteyn130 wrote:

I'm a little confused by the meaning of the second column in Ensembl's GTF annotation sets. According to the README and online documentation, the second column is supposed to be the source of annotation (e.g. "havana"). However, when I actually look at the release 75 GTF (ftp directory), it looks like this:

#!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1    pseudogene    gene    11869    14412    .    +    .    gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1    processed_transcript    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";

Notice how the second column actually seems to contain the transcript_biotype (which is missing from the attributes), and the gene_source *is* in the attributes? Is this a bug in their GTF generation? Is the documentation for some older version of GTF which is no longer supposed to be used? 

ensembl gtf • 2.6k views
ADD COMMENTlink modified 5.9 years ago by Denise CS5.1k • written 5.9 years ago by alex.rubinsteyn130

I would guess that somewhere along the line Ensembl people decided to use the second column to store "bio_type" rather than the "source". I don't think it is a bug or something to do with old or new GTF format. 

ADD REPLYlink written 5.9 years ago by Ashutosh Pandey12k

Wouldn't changing the meaning of the columns be a change in the GTF format? Otherwise, if the columns can mean arbitrary things, how is it a format at all? 

ADD REPLYlink written 5.9 years ago by alex.rubinsteyn130

I would still regard it as a format with loose structure with some columns following strict definitions while some not. All the columns were created to represent specific information or had some specific purpose at the time the GFF format was created. Later on some of these columns became non-useful but as this format was so widely adopted, people thought it won't be a good idea just to remove some of these columns. The first (chr), third (genic feature), fourth (start), fifth (end) and seventh columns (strand) have strict definitions and should contain the same information disregard of the source of the gtf file. I guess the information in second column or source column was used by people in the beginning but now it is not that important. Most of the current program that use gtf file use chromosome, start, end, strand information to extract positions of the genic feature. The third column and information in the ninth column is used to create hierarchy that relates exons to transcripts and transcripts to genes. I think pretty much most of the tools like snpEff (annotate variants) or RNA-seq count or RPKM generators only depend on columns that follow strict definitions.Whereas columns such as sixth column that was used to be a score column is not used anymore and contains ".".  You can pretty much store any numeric information there.  

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Ashutosh Pandey12k
gravatar for Denise CS
5.9 years ago by
Denise CS5.1k
UK, Hinxton, EMBL-EBI
Denise CS5.1k wrote:

You are right there have been inconsistencies between the GTF file and the documentation. The second column was displaying either the status or the biotype whereas the documentation had always the second column as the status. From release 77 onwards the inconsistency is no longer in place though. It should be status, always. Having the second column as the status is in accordance with the GENCODE GTF format. Apologises for the confusion.


ADD COMMENTlink written 5.9 years ago by Denise CS5.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1206 users visited in the last hour