I'm a little confused by the meaning of the second column in Ensembl's GTF annotation sets. According to the README and online documentation, the second column is supposed to be the source of annotation (e.g. "havana"). However, when I actually look at the release 75 GTF (ftp directory), it looks like this:
#!genome-build GRCh37.p13 #!genome-version GRCh37 #!genome-date 2009-02 #!genome-build-accession NCBI:GCA_000001405.14 #!genebuild-last-updated 2013-09 1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; 1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
Notice how the second column actually seems to contain the transcript_biotype (which is missing from the attributes), and the gene_source *is* in the attributes? Is this a bug in their GTF generation? Is the documentation for some older version of GTF which is no longer supposed to be used?