I just downloaded a GTF file from Ensembl and I noticed that the transcript_id tag is missing from some records in the attributes field. I read that transcript_id and gene_id are mandatory tags (see https://genome.ucsc.edu/FAQ/FAQformat.html#format4). Is the file corrupted or the requirements are relaxed?
Thanks in advance!
These are the attribute fields of the first two records (see missing transcript_id in the first one).
I think I have an answer for this one. I wrote to Ensembl and here is the reply from them.
Until release 74, the gtf files had no gene lines, only transcripts. Since then, we have added these in, but they do not have a transcript_id attribute as a gene can have several transcripts and it is not a one-to-one relationship.
According to the gtf specifications, any non-required field should be ignored, so we did not anticipate that some software would break because of this addition.
We are now aware of this issue and are investigating a solution that would cover both use cases. In the mean time, I would recommend removing the gene lines from the gtf file before submitting them to the RNA-SeQC tool.
This was obviously regarding my particular need for transcript_id tag. In my case I simply removed all gene feature lines from the gtf file and RNA-SeQC worked like a charm. I feel that GENCODE gtf flles miss led me by having transcript_id tag in the gene feature line. In GENCODE gtf the value of transcript_id in the gene feature line is identical to gene_id value, which is miss leading in my view. And as mentioned in the email a gene can have more than one transcripts.
I hope this info will help clear some confusion. It have certainly helped me.
I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is newer compare to 74 release and my RNA-SeQC failed b'ze of the format compatibility issue...the error is
java.lang.RuntimeException: No rRNA found in GTF transcript_type field
My gtf file looks like this.. and exactly no "transcript_type" instead "transcript_biotype"
Then I edited the file gene_type and transcript_type where gene_biotype and transcript_biotype it work fine. Is this right and simplest way of change the latest format of gtf to run RNA-SeQC or will this cuase any issue in the results?
Which gtf file you are talking about? And why do you care about transcript_id field? It looks like from the description that you are using human gft file. I'd be very interested to know which version? I'd bet Ensembl 38, latest release. My understanding gft file had been modified in the latest release. Here is the problem I've encountered A: RNA-SeQC error no output Just scroll down to the bottom, this isn't my question, but read comments below.
I would say that the presence of that attribute is mandatory and is perhaps a bug
I don't think it is a bug, rather a conscious decision by Ensembl, I believe. I am talking about Homo sapiens GTF file only. If you download a few different gtf version from here http://www.gencodegenes.org/releases/ and compare then. You will notice that gene attribute line in gtf file used to have gene_id field and
transcript_idfield. Here is the list of gft file that I have looked (compared) at:
In all of those gtf files except two latest ones
gencode.v22.annotation.gtf. (which are the same annotation from two different sources)
transcript_idfield is present. BUT if you look closely
transcript_idfield in the gene line has the same value as the
gene_id! And I understand this was a bug of some sort. So new, the latest gtf is actually an improved version. Although I suspect many tools might have not adapted to this as yet.
I'd be very interested to know more on this topic, because I feel its important to understand if this is or isn't a bug. Like I mentioned in my previous comment I couldn't perform RNA-SeQC report when I used Ensembl 38 genome annotation.
I've noticed this too with these GTF files and I was also quite surprised by the lack of transcript_id fields. I'm also not entirely sure whether this is intentional or a bug. From googling around when I first noticed this, it seems that the presence of "transcript_id" isn't always specified in descriptions of GTF. I think much of the problem is that there's no real gold standard specification for the format. The closest I've seen is from Ensembl, which basically says, "it's GFF version 2". In fact, even the examples that Ensembl gives lack transcript_ids. This makes sense now that some sources are including "gene" entries, for which a transcript_id has no meaning.
Perhaps we should push to get GTF taken over by the GA4GH file formats team. That'd at least allow a single format definition.
Edit: If others are in favor of the GA4GH route I'd be happy to contact them. Format spec. inconsistencies like this really need to be nipped in the bud.
AFAIK the GTF 2.0 format is actually defined by having the fields
transcript_idpresent. Otherwise it would be a GFF 2.0 file. On the other hand it was clearly ... what is even the right word ... unwise ... to introduce a new "format" called GTF for the sole reason of enforcing these two attributes.
A file that mixes rows of GFF an GTF is still a valid GFF file and as such should be called GFF. Of course it does not help that there is a GFF 3 format that is similar to GFF 2.0
I agreed with you until I found Ensembl explicitly defining GTF as GFF 2.0. I'm of the opinion that that was a bad move by Ensembl, but it becomes a question of who gets to define things. I think GTF2.2 as defined by the Brent lab is what most of us conceive of by the format, but even they mention revising the Ensembl GTF (aka GFF 2.0) definition.
I am referring to Homo_sapiens.GRCh38.79.gtf from Ensembl (yes, the latest release).
I realise that in my previous post I didn't supply correct link to the github issue. Here it is https://github.com/broadinstitute/RNA-SeQC/issues/1