Question

Ensembl GTF format: isn't the tag "transcript_id" mandatory?

4

Entering edit mode

9.0 years ago

Pfs ▴ 580

I just downloaded a GTF file from Ensembl and I noticed that the transcript_id tag is missing from some records in the attributes field. I read that transcript_id and gene_id are mandatory tags (see https://genome.ucsc.edu/FAQ/FAQformat.html#format4). Is the file corrupted or the requirements are relaxed?

Thanks in advance!

These are the attribute fields of the first two records (see missing transcript_id in the first one).

'gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "...'
'gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; ...'

annotation RNA-Seq GTF GFF ensembl • 9.7k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by Pfs ▴ 580

0

Entering edit mode

Which gtf file you are talking about? And why do you care about transcript_id field? It looks like from the description that you are using human gft file. I'd be very interested to know which version? I'd bet Ensembl 38, latest release. My understanding gft file had been modified in the latest release. Here is the problem I've encountered A: RNA-SeQC error no output Just scroll down to the bottom, this isn't my question, but read comments below.

ADD REPLY • link updated 8.9 years ago by Emily 23k • written 9.0 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

I would say that the presence of that attribute is mandatory and is perhaps a bug

ADD REPLY • link 9.0 years ago by Istvan Albert 100k

0

Entering edit mode

I don't think it is a bug, rather a conscious decision by Ensembl, I believe. I am talking about Homo sapiens GTF file only. If you download a few different gtf version from here http://www.gencodegenes.org/releases/ and compare then. You will notice that gene attribute line in gtf file used to have gene_id field and transcript_id field. Here is the list of gft file that I have looked (compared) at:

gencode.v19.annotation.gtf
gencode.v20.annotation.gtf
gencode.v21.annotation.gtf
gencode.v22.annotation.gtf
gencode.v7.annotation.gtf
Homo_sapiens.GRCh37.62.gtf
Homo_sapiens.GRCh37.74.gtf
Homo_sapiens.GRCh37.75.gtf
Homo_sapiens.GRCh38.76.gtf
Homo_sapiens.GRCh38.77.gtf
Homo_sapiens.GRCh38.78.gtf
Homo_sapiens.GRCh38.79.gtf

In all of those gtf files except two latest ones Homo_sapiens.GRCh38.79 and gencode.v22.annotation.gtf. (which are the same annotation from two different sources) transcript_id field is present. BUT if you look closely transcript_id field in the gene line has the same value as the gene_id! And I understand this was a bug of some sort. So new, the latest gtf is actually an improved version. Although I suspect many tools might have not adapted to this as yet.

I'd be very interested to know more on this topic, because I feel its important to understand if this is or isn't a bug. Like I mentioned in my previous comment I couldn't perform RNA-SeQC report when I used Ensembl 38 genome annotation.

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

I've noticed this too with these GTF files and I was also quite surprised by the lack of transcript_id fields. I'm also not entirely sure whether this is intentional or a bug. From googling around when I first noticed this, it seems that the presence of "transcript_id" isn't always specified in descriptions of GTF. I think much of the problem is that there's no real gold standard specification for the format. The closest I've seen is from Ensembl, which basically says, "it's GFF version 2". In fact, even the examples that Ensembl gives lack transcript_ids. This makes sense now that some sources are including "gene" entries, for which a transcript_id has no meaning.

Perhaps we should push to get GTF taken over by the GA4GH file formats team. That'd at least allow a single format definition.

Edit: If others are in favor of the GA4GH route I'd be happy to contact them. Format spec. inconsistencies like this really need to be nipped in the bud.

ADD REPLY • link 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

AFAIK the GTF 2.0 format is actually defined by having the fields gene_id and transcript_id present. Otherwise it would be a GFF 2.0 file. On the other hand it was clearly ... what is even the right word ... unwise ... to introduce a new "format" called GTF for the sole reason of enforcing these two attributes.

http://mblab.wustl.edu/GTF22.html

A file that mixes rows of GFF an GTF is still a valid GFF file and as such should be called GFF. Of course it does not help that there is a GFF 3 format that is similar to GFF 2.0

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by Istvan Albert 100k

0

Entering edit mode

I agreed with you until I found Ensembl explicitly defining GTF as GFF 2.0. I'm of the opinion that that was a bad move by Ensembl, but it becomes a question of who gets to define things. I think GTF2.2 as defined by the Brent lab is what most of us conceive of by the format, but even they mention revising the Ensembl GTF (aka GFF 2.0) definition.

ADD REPLY • link 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

I am referring to Homo_sapiens.GRCh38.79.gtf from Ensembl (yes, the latest release).

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by Pfs ▴ 580

1

Entering edit mode

I realise that in my previous post I didn't supply correct link to the github issue. Here it is https://github.com/broadinstitute/RNA-SeQC/issues/1

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by Kirill Tsyganov ▴ 370

Ram · Answer 1 · 2015-05-12

Hi guys,

I think I have an answer for this one. I wrote to Ensembl and here is the reply from them.

Until release 74, the gtf files had no gene lines, only transcripts. Since then, we have added these in, but they do not have a transcript_id attribute as a gene can have several transcripts and it is not a one-to-one relationship.

According to the gtf specifications, any non-required field should be ignored, so we did not anticipate that some software would break because of this addition.

We are now aware of this issue and are investigating a solution that would cover both use cases. In the mean time, I would recommend removing the gene lines from the gtf file before submitting them to the RNA-SeQC tool.

This was obviously regarding my particular need for transcript_id tag. In my case I simply removed all gene feature lines from the gtf file and RNA-SeQC worked like a charm. I feel that GENCODE gtf flles miss led me by having transcript_id tag in the gene feature line. In GENCODE gtf the value of transcript_id in the gene feature line is identical to gene_id value, which is miss leading in my view. And as mentioned in the email a gene can have more than one transcripts.

I hope this info will help clear some confusion. It have certainly helped me.

Ram · Answer 2 · 2015-09-21

Hi,

I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is newer compare to 74 release and my RNA-SeQC failed b'ze of the format compatibility issue...the error is

java.lang.RuntimeException: No rRNA found in GTF transcript_type field
        at org.broadinstitute.cga.rnaseq.TranscriptList.toRRNAIntervalList(TranscriptList.java:414)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.createRefGeneAndRRNAFiles(RNASeqMetrics.java:1288)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.prepareFiles(RNASeqMetrics.java:191)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:165)
        at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:135)

My gtf file looks like this.. and exactly no "transcript_type" instead "transcript_biotype"

#!genome-build GRCz10
#!genome-version GRCz10
#!genome-date 2014-09
#!genome-build-accession NCBI:GCA_000002035.3
#!genebuild-last-updated 2015-05
4       ensembl exon    52002   52120   .       -       .       gene_id "ENSDARG00000104632"; gene_version "1"; transcript_id "ENSDART00000166186"; transcript_version "1"; exon_number "1"; gene_name "si:ch73-252i11.3"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "si:ch73-252i11.3-201"; transcript_source "ensembl"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001152240"; exon_version "1";

Then I edited the file gene_type and transcript_type where gene_biotype and transcript_biotype it work fine. Is this right and simplest way of change the latest format of gtf to run RNA-SeQC or will this cuase any issue in the results?

Hope someone could clear me :)

Thanks.

Justin