I have a gtf file that looks as follows (lines for transcript ENSMUST00000027477):
protein_coding exon 87510093 87510362 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000342435";
1 protein_coding CDS 87510093 87510199 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding start_codon 87510197 87510199 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";
1 protein_coding exon 87509245 87509387 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000358865";
1 protein_coding CDS 87509245 87509387 . - 1 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87503266 87503573 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325521";
1 protein_coding CDS 87503266 87503573 . - 2 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87489584 87489744 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325514";
1 protein_coding CDS 87489584 87489744 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87487799 87487951 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157375";
1 protein_coding CDS 87487799 87487951 . - 1 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87486156 87486285 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157378";
1 protein_coding CDS 87486156 87486285 . - 1 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87484599 87484673 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001254502";
1 protein_coding CDS 87484599 87484673 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87482623 87482712 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001289300";
1 protein_coding CDS 87482623 87482712 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87481116 87481279 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001251518";
1 protein_coding CDS 87481116 87481279 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87480587 87480742 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001242957";
1 protein_coding CDS 87480587 87480742 . - 1 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87479837 87479916 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325453";
1 protein_coding CDS 87479837 87479916 . - 1 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87479103 87479207 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325447";
1 protein_coding CDS 87479103 87479207 . - 2 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding exon 87476834 87477744 . - . gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000511304";
1 protein_coding CDS 87477557 87477744 . - 2 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1 protein_coding stop_codon 87477554 87477556 . - 0 gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";
I am not entirely certain, but I think it contains all the information to determine the coordinates of the transcript ENSMUST00000027477 (from the start of exon 1 to the end of the last exon). However, there is no "transcript" line with such information.
I am trying to make a Kallisto index with this gtf file, and it is giving me an error precisely because it is missing the transcript lines:
Exception: The following transcripts have a "exon" feature but no corresponding "transcript" feature(s):
Does anybody know of a way of including such transcript feature? I can imagine adding it "manually" for each transcript using bash commands but I wonder if there is a tool available that performs this.
Thank you!!
You could try AGAT, which has a function
agat_convert_sp_gxf2gxf.pl
to try to add missing features to a GTF or GFF3 file. I think example 9 on the main page is similar to what you want to do.Thank you! Yes, that is exactly what I needed.
You could try to run it through one of the script/tools in from AGAT, that might fill in 'missing lines' . I think that
agat_convert_sp_gxf2gxf.pl
is a good option.More info also in this post :AGAT - Another Gff Analysis Toolkit
Thank you so much!! If I may, I am going to paste the question I made to rpolicastro on the same topic:
Thank you! Yes, that is exactly what I needed. However the function does not seem to be working for me. Here is the output that I get after executing it:
And then it stops there.... no more output. The gtf looks fine, do you have any experience with this program? Thank you!!
Did it crash? How long dit it run? I recently added some extra check very specific at this blocking step, maybe I added something not efficient at al in term of calcul, and you end up in that case. What version of AGAT do you use ? Could you share your file ?
Hi Juke34. It seemed like it wasn't working but it ended up running! It was just that it took very long (~5 hours). Here is the output I got from that section:
I think this step was precisely solving the issue I had: the lack of transcript feature for each group of exons corresponding to a transcript.
Thanks so much for your reply and thank you so much for making this software, it is fantastic!
Edit: the gtf looks fine to me, but it seems like Kallisto cannot use it to make a Kallisto index. Here is a portion of the errors I get:
I assume the conversion your software makes is not compatible with kb ref?
Edit2: I found this link that mentions the gtf format kb ref needs:
https://github.com/pachterlab/kb_python/issues/48
I quote: "Specifically, kb looks for gene_id gene_name transcript_id GTF attributes, and these attributes must be semicolon-separated list of tag-value pairs separated by a single space."
I observed that the pairs are separated by "=", so I ran the following command:
Here is an example of the new lines:
PS: I know I need to change "mRNA" by "transcript" according to the original error message of this post, but this doesn't seem to be the origin of this issue...
The default output from agat_convert_sp_gxf2gxf.pl is a GFF file. Please use agat_convert_sp_gff2gtf.pl on the newly created file to make a proper GTF file.
can this be used to add entrez ID feature to existing gtf file which don;t have entrez ID ?
Do you talk about AGAT? AGAT is not designed to retrieve entrez ID. If you have them in a tsv file with the corresponding features ID ( the same as hold in you GTF/GFF file), then yes AGAT can attach information.
yes I have made tsv file where I have entrez ID and then ensembl ID for respective entrezID. . In the gtf file it looks something like this
and My reference file that holds the feature mean entrezID which I want to add looks like this
Will this format work with AGAT ?
Is this the resource
https://agat.readthedocs.io/en/latest/tools/agat_sq_add_attributes_from_tsv.html
agat_sq_add_attributes_from_tsv.pl
can help but you will need the gene_id in the ID column as explainer here:https://github.com/NBISweden/AGAT/blob/f564798d63a3a0f326fe81f22533b92cc31ecddf/bin/agat_sq_add_attributes_from_tsv.pl#L149C1-L162
"the gene_id in the ID column as explainer here" bit confused about this, so I have to make my reference file as shown in above link?
You must add as first column of your tsv file the unique identifier used to identify the features in your gtf which are the values attached to the gene_id attributes.
Going by the example in the agat_sq_add code for the input tsv it is as such
Now In my case how should I do , since I want to put the
entrezID
into the gtf file which are unique identifier at the same time it should match the respective "ENSCAFG" for each entrez id in the gtf file. So i should change my reference file2nd
column to the first column and name it asgene_id
to match the gtf. That should work I supposeRight but you should also remove
VGNC:VGNC:43937|Ensembl:
extra info and keep only the identifier e.g. ENSCAFG00845006432thank you for the clarification
I created the file as you suggested now I do see there are rows which didn't have entrez ID mapping so basically those lines are empty. So I should also filter those empty rows to be the right input?