I just downloaded a GTF file from Ensembl and I noticed that the "transcript_id" tag is missing from some records in the "attributes" field. I read that "transcript_id" and "gene_id" are mandatory tags (see https://genome.ucsc.edu/FAQ/FAQformat.html#format4). Is the file corrupted or the requirements are relaxed?
Thanks in advance!
These are the attribute fields of the first two records (see missing transcript_id in the first one).
I think I have an answer for this one. I wrote to Ensembl and here is the reply from them.
Until release 74, the gtf files had no gene lines, only transcripts.
Since then, we have added these in, but they do not have a transcript_id
attribute as a gene can have several transcripts and it is not a one-to-one
According to the gtf specifications, any non-required field should be ignored,
so we did not anticipate that some software would break because of this
We are now aware of this issue and are investigating a solution that would
cover both use cases.
In the mean time, I would recommend removing the gene lines from the gtf file
before submitting them to the RNA-SeQC tool.
This was obviously regarding my particular need for `transcript_id` tag. In my case I simply removed all `gene` feature lines from the gtf file and `RNA-SeQC` worked like a charm. I feel that GENCODE gtf flles miss led me by having `transcript_id` tag in the `gene` feature line. In GENCODE gtf the value of `transcript_id` in the `gene` feature line is identical to `gene_id` value, which is miss leading in my view. And as mentioned in the email a gene can have more than one transcripts.
I hope this info will help clear some confusion. It have certainly helped me.
I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is newer compare to 74 release and my RNA-SeQC failed b'ze of the format compatibility issue...the error is
java.lang.RuntimeException: No rRNA found in GTF transcript_type field
My gtf file looks like this.. and exactly no "transcript_type" instead "transcript_biotype"
Then I edited the file "gene_type" and "transcript_type" where gene_biotype and transcript_biotype it work fine. Is this right and simplest way of change the latest format of gtf to run RNA-SeQC or will this cuase any issue in the results?