How to add "transcript" feature to a gtf file?
0
1
Entering edit mode
20 months ago

I have a gtf file that looks as follows (lines for transcript ENSMUST00000027477):

protein_coding  exon    87510093    87510362    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000342435";
1   protein_coding  CDS 87510093    87510199    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  start_codon 87510197    87510199    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";
1   protein_coding  exon    87509245    87509387    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000358865";
1   protein_coding  CDS 87509245    87509387    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87503266    87503573    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325521";
1   protein_coding  CDS 87503266    87503573    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87489584    87489744    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325514";
1   protein_coding  CDS 87489584    87489744    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87487799    87487951    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157375";
1   protein_coding  CDS 87487799    87487951    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87486156    87486285    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157378";
1   protein_coding  CDS 87486156    87486285    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87484599    87484673    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001254502";
1   protein_coding  CDS 87484599    87484673    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87482623    87482712    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001289300";
1   protein_coding  CDS 87482623    87482712    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87481116    87481279    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001251518";
1   protein_coding  CDS 87481116    87481279    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87480587    87480742    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001242957";
1   protein_coding  CDS 87480587    87480742    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87479837    87479916    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325453";
1   protein_coding  CDS 87479837    87479916    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87479103    87479207    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325447";
1   protein_coding  CDS 87479103    87479207    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87476834    87477744    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000511304";
1   protein_coding  CDS 87477557    87477744    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  stop_codon  87477554    87477556    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";

I am not entirely certain, but I think it contains all the information to determine the coordinates of the transcript ENSMUST00000027477 (from the start of exon 1 to the end of the last exon). However, there is no "transcript" line with such information.

I am trying to make a Kallisto index with this gtf file, and it is giving me an error precisely because it is missing the transcript lines:

Exception: The following transcripts have a "exon" feature but no corresponding "transcript" feature(s):

Does anybody know of a way of including such transcript feature? I can imagine adding it "manually" for each transcript using bash commands but I wonder if there is a tool available that performs this.

Thank you!!

gtf • 893 views
ADD COMMENT
3
Entering edit mode

You could try AGAT, which has a function agat_convert_sp_gxf2gxf.pl to try to add missing features to a GTF or GFF3 file. I think example 9 on the main page is similar to what you want to do.

ADD REPLY
0
Entering edit mode

Thank you! Yes, that is exactly what I needed.

ADD REPLY
3
Entering edit mode

You could try to run it through one of the script/tools in from AGAT, that might fill in 'missing lines' . I think that agat_convert_sp_gxf2gxf.pl is a good option.

More info also in this post :AGAT - Another Gff Analysis Toolkit

ADD REPLY
0
Entering edit mode

Thank you so much!! If I may, I am going to paste the question I made to rpolicastro on the same topic:

Thank you! Yes, that is exactly what I needed. However the function does not seem to be working for me. Here is the output that I get after executing it:

    Filtering ontology:
        We found 1757 terms that are sequence_feature or is_a child of it.
-------------------------------- parse features --------------------------------
=> GFF version parser used: 2
********************************************************************************
*                               - End parsing -                                *
*                             done in 467 seconds                              *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------

And then it stops there.... no more output. The gtf looks fine, do you have any experience with this program? Thank you!!

ADD REPLY
0
Entering edit mode

Did it crash? How long dit it run? I recently added some extra check very specific at this blocking step, maybe I added something not efficient at al in term of calcul, and you end up in that case. What version of AGAT do you use ? Could you share your file ?

ADD REPLY
0
Entering edit mode

Hi Juke34. It seemed like it wasn't working but it ended up running! It was just that it took very long (~5 hours). Here is the output I got from that section:

--------------------------- Check4: l2 linked to l3 ----------------------------
88226 cases fixed where L3 features have parent feature(s) missing
---------------------------- done in 17232 seconds -----------------------------

I think this step was precisely solving the issue I had: the lack of transcript feature for each group of exons corresponding to a transcript.

Thanks so much for your reply and thank you so much for making this software, it is fantastic!

Edit: the gtf looks fine to me, but it seems like Kallisto cannot use it to make a Kallisto index. Here is a portion of the errors I get:

2021-02-05 11:01:58,563] WARNING Failed to parse GTF attributes of entry: X protein_coding  CDS 57299460    57299661    .   -   0   ID=cds-72867;Parent=ENSMUST00000033468;exon_number=5;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,564] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57302310    57302431    .   -   2   ID=cds-72866;Parent=ENSMUST00000033468;exon_number=4;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,565] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57304566    57304650    .   -   0   ID=cds-72865;Parent=ENSMUST00000033468;exon_number=3;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,565] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57336952    57337035    .   -   0   ID=cds-72864;Parent=ENSMUST00000033468;exon_number=2;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,566] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57338338    57338574    .   -   0   ID=cds-72863;Parent=ENSMUST00000033468;exon_number=1;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

I assume the conversion your software makes is not compatible with kb ref?

Edit2: I found this link that mentions the gtf format kb ref needs:

https://github.com/pachterlab/kb_python/issues/48

I quote: "Specifically, kb looks for gene_id gene_name transcript_id GTF attributes, and these attributes must be semicolon-separated list of tag-value pairs separated by a single space."

I observed that the pairs are separated by "=", so I ran the following command:

sed 's/=/ /g' mm10.converted.gtf > mm10.converted_equal.gtf

Here is an example of the new lines:

MG4209_PATCH    protein_coding  start_codon 45949669    45949671    .   +   0   ID start_codon-40939;Parent ENSMUST00000180606;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097528;gene_name AC233742.1;transcript_id ENSMUST00000180606;transcript_name AC233742.1-201
MG4209_PATCH    protein_coding  gene    46156520    46184432    .   -   .   ID nbis-gene-22893;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;protein_id ENSMUSP00000137714;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  mRNA    46156520    46184432    .   -   .   ID ENSMUST00000180451;Parent nbis-gene-22893;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;protein_id ENSMUSP00000137714;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  exon    46156520    46156586    .   -   .   ID ENSMUSE00001120372;Parent ENSMUST00000180451;exon_id ENSMUSE00001120372;exon_number 7;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  exon    46157336    46157453    .   -   .   ID ENSMUSE00001098262;Parent ENSMUST00000180451;exon_id ENSMUSE00001098262;exon_number 6;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201

PS: I know I need to change "mRNA" by "transcript" according to the original error message of this post, but this doesn't seem to be the origin of this issue...

ADD REPLY
1
Entering edit mode

The default output from agat_convert_sp_gxf2gxf.pl is a GFF file. Please use agat_convert_sp_gff2gtf.pl on the newly created file to make a proper GTF file.

ADD REPLY

Login before adding your answer.

Traffic: 1646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6