Ensembl GTF format: isn't the tag "transcript_id" mandatory?
2
2
Entering edit mode
6.5 years ago
Pfs ▴ 530

I just downloaded a GTF file from Ensembl and I noticed that the "transcript_id" tag is missing from some records in the "attributes" field. I read that "transcript_id" and "gene_id" are mandatory tags  (see https://genome.ucsc.edu/FAQ/FAQformat.html#format4). Is the file corrupted or the requirements are relaxed?

Thanks in advance!

 

These are the attribute fields of the first two records (see missing transcript_id in the first one).

 'gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "...'

 'gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; ...'

RNA-Seq annotation GTF GFF ensembl • 6.3k views
ADD COMMENT
0
Entering edit mode

Which gtf file you are talking about? And why do you care about transcript_id field? It looks like from the description that you are using human gft file. I'd be very interested to know which version? I'd bet Ensembl 38, latest release. My understanding gft file had been modified in the latest release. Here is the problem I've encountered A: RNA-SeQC error no output Just scroll down to the bottom, this isn't my question, but read comments below.

ADD REPLY
0
Entering edit mode

I would say that the presence of that attribute is mandatory and is perhaps a bug

ADD REPLY
0
Entering edit mode

I don't think it is a bug, rather a conscious decision by Ensembl, I believe.  I am talking about Homo sapiens GTF file only. If you download a few different gtf version from here http://www.gencodegenes.org/releases/ and  compare then. You will notice that gene attribute line in gtf file used to have gene_id field and transcript_id field. Here is the list of gft file that I have looked (compared) at: 

  • gencode.v19.annotation.gtf
  • gencode.v20.annotation.gtf
  • gencode.v21.annotation.gtf
  • gencode.v22.annotation.gtf
  • gencode.v7.annotation.gtf
  • Homo_sapiens.GRCh37.62.gtf
  • Homo_sapiens.GRCh37.74.gtf
  • Homo_sapiens.GRCh37.75.gtf
  • Homo_sapiens.GRCh38.76.gtf
  • Homo_sapiens.GRCh38.77.gtf
  • Homo_sapiens.GRCh38.78.gtf
  • Homo_sapiens.GRCh38.79.gtf

In all of those gtf files except two latest ones Homo_sapiens.GRCh38.79 and gencode.v22.annotation.gtf. (which are the same annotation from two different sources) transcript_id field is present. BUT if you look closely transcript_id field in the gene line has the same value as the gene_id ! And I understand this was a bug of some sort. So new, the latest gtf is actually an improved version. Although I suspect many tools might have not adapted to this as yet. 

I'd be very interested to know more on this topic, because I feel its important to understand if this is or isn't a bug. Like I mentioned in my previous comment I couldn't perform RNA-SeQC report when I used Ensembl 38 genome annotation.

 

 

ADD REPLY
0
Entering edit mode

I've noticed this too with these GTF files and I was also quite surprised by the lack of transcript_id fields. I'm also not entirely sure whether this is intentional or a bug. From googling around when I first noticed this, it seems that the presence of "transcript_id" isn't always specified in descriptions of GTF. I think much of the problem is that there's no real gold standard specification for the format. The closest I've seen is from Ensembl, which basically says, "it's GFF version 2". In fact, even the examples that Ensembl gives lack transcript_ids. This makes sense now that some sources are including "gene" entries, for which a transcript_id has no meaning.

Perhaps we should push to get GTF taken over by the GA4GH file formats team. That'd at least allow a single format definition.

Edit: If others are in favor of the GA4GH route I'd be happy to contact them. Format spec. inconsistencies like this really need to be nipped in the bud.

ADD REPLY
0
Entering edit mode

AFAIK the GTF 2.0 format is actually defined by having the fields gene_id and transcript_id present. Otherwise it would be a GFF 2.0 file. On the other hand it was clearly ... what is even the right word ... unwise ... to introduce a new "format" called GTF for the sole reason of enforcing these two attributes. 

http://mblab.wustl.edu/GTF22.html

A file that mixes rows of GFF an GTF is still a valid GFF file and as such should be called GFF. Of course it does not help that there is a GFF 3 format that is similar to GFF 2.0

 

ADD REPLY
0
Entering edit mode

I agreed with you until I found Ensembl explicitly defining GTF as GFF 2.0. I'm of the opinion that that was a bad move by Ensembl, but it becomes a question of who gets to define things. I think GTF2.2 as defined by the Brent lab is what most of us conceive of by the format, but even they mention revising the Ensembl GTF (aka GFF 2.0) definition.

ADD REPLY
0
Entering edit mode

I am referring to Homo_sapiens.GRCh38.79.gtf from Ensembl (yes, the latest release).

 

ADD REPLY
1
Entering edit mode

I realise that in my previous post I didn't supply correct link to the github issue. Here it is https://github.com/broadinstitute/RNA-SeQC/issues/1

ADD REPLY
4
Entering edit mode
6.5 years ago

Hi guys, 

I think I have an answer for this one. I wrote to Ensembl and here is the reply from them. 

Until release 74, the gtf files had no gene lines, only transcripts.
Since then, we have added these in, but they do not have a transcript_id
attribute as a gene can have several transcripts and it is not a one-to-one
relationship.

According to the gtf specifications, any non-required field should be ignored,
so we did not anticipate that some software would break because of this
addition.
We are now aware of this issue and are investigating a solution that would
cover both use cases.
In the mean time, I would recommend removing the gene lines from the gtf file
before submitting them to the RNA-SeQC tool.
 

This was obviously regarding my particular need for `transcript_id` tag. In my case I simply removed all `gene` feature lines from the gtf file and `RNA-SeQC` worked like a charm. I feel that GENCODE gtf flles miss led me by having `transcript_id` tag in the `gene` feature line. In GENCODE gtf the value of `transcript_id` in the `gene` feature line is identical to `gene_id` value, which is miss leading in my view. And as mentioned in the email a gene can have more than one transcripts.

I hope this info will help clear some confusion. It have certainly helped me.

 

ADD COMMENT
0
Entering edit mode

Personally I decided to differentiate the old Ensemble GTF format (until the release 74 ) of the new Ensemble GTF format (over the release 74) to call that last one "GTF3".

With the old version it was a bit painful to rebuild the transcripts and the genes from the exon and CDS features. Now they tend to a format close to the GFF3 defined by the sequence ontology consortium (http://www.sequenceontology.org/resources/gff3.html)

I try to use as most as possible the GFF3 format that has well defined specification.

I hope in the future that Ensemble shift the GTF format to the GFF3. This format allows them to still use their own Ensembl specific attributes (9th column).

ADD REPLY
0
Entering edit mode
6.1 years ago
justinjj • 0

Hi,

I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is newer compare to 74 release and my RNA-SeQC failed b'ze of the format compatibility issue...the error is 

java.lang.RuntimeException: No rRNA found in GTF transcript_type field
        at org.broadinstitute.cga.rnaseq.TranscriptList.toRRNAIntervalList(TranscriptList.java:414)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.createRefGeneAndRRNAFiles(RNASeqMetrics.java:1288)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.prepareFiles(RNASeqMetrics.java:191)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:165)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:135)

My gtf file looks like this.. and exactly no "transcript_type" instead "transcript_biotype"

#!genome-build GRCz10
#!genome-version GRCz10
#!genome-date 2014-09
#!genome-build-accession NCBI:GCA_000002035.3
#!genebuild-last-updated 2015-05
4       ensembl exon    52002   52120   .       -       .       gene_id "ENSDARG00000104632"; gene_version "1"; transcript_id "ENSDART00000166186"; transcript_version "1"; exon_number "1"; gene_name "si:ch73-252i11.3"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "si:ch73-252i11.3-201"; transcript_source "ensembl"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001152240"; exon_version "1";

Then I edited the file "gene_type" and "transcript_type" where gene_biotype and transcript_biotype it work fine. Is this right and simplest way of change the latest format of gtf to run RNA-SeQC or will this cuase any issue in the results?

Hope someone could clear me :) thanks. Justin

 

 

ADD COMMENT
0
Entering edit mode

Please post things like this as a new question next time.

What you did should be fine, I'm surprised RNA-SeQC doesn't allow you to just specify the change with an option.

ADD REPLY

Login before adding your answer.

Traffic: 2193 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6