Question

How to add "transcript" feature to a gtf file?

3

Entering edit mode

3.2 years ago

pomodoro_sinensis ▴ 110

I have a gtf file that looks as follows (lines for transcript ENSMUST00000027477):

protein_coding  exon    87510093    87510362    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000342435";
1   protein_coding  CDS 87510093    87510199    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  start_codon 87510197    87510199    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "1"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";
1   protein_coding  exon    87509245    87509387    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000358865";
1   protein_coding  CDS 87509245    87509387    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "2"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87503266    87503573    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325521";
1   protein_coding  CDS 87503266    87503573    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "3"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87489584    87489744    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325514";
1   protein_coding  CDS 87489584    87489744    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "4"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87487799    87487951    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157375";
1   protein_coding  CDS 87487799    87487951    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "5"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87486156    87486285    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000157378";
1   protein_coding  CDS 87486156    87486285    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "6"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87484599    87484673    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001254502";
1   protein_coding  CDS 87484599    87484673    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "7"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87482623    87482712    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001289300";
1   protein_coding  CDS 87482623    87482712    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "8"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87481116    87481279    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001251518";
1   protein_coding  CDS 87481116    87481279    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "9"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87480587    87480742    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00001242957";
1   protein_coding  CDS 87480587    87480742    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "10"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87479837    87479916    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325453";
1   protein_coding  CDS 87479837    87479916    .   -   1   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "11"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87479103    87479207    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000325447";
1   protein_coding  CDS 87479103    87479207    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "12"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  exon    87476834    87477744    .   -   .   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; exon_id "ENSMUSE00000511304";
1   protein_coding  CDS 87477557    87477744    .   -   2   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002"; protein_id "ENSMUSP00000027477";
1   protein_coding  stop_codon  87477554    87477556    .   -   0   gene_id "ENSMUSG00000026259"; transcript_id "ENSMUST00000027477"; exon_number "13"; gene_name "Ngef"; gene_biotype "protein_coding"; transcript_name "Ngef-002";

I am not entirely certain, but I think it contains all the information to determine the coordinates of the transcript ENSMUST00000027477 (from the start of exon 1 to the end of the last exon). However, there is no "transcript" line with such information.

I am trying to make a Kallisto index with this gtf file, and it is giving me an error precisely because it is missing the transcript lines:

Exception: The following transcripts have a "exon" feature but no corresponding "transcript" feature(s):

Does anybody know of a way of including such transcript feature? I can imagine adding it "manually" for each transcript using bash commands but I wonder if there is a tool available that performs this.

Thank you!!

gtf • 3.2k views

ADD COMMENT • link updated 9 months ago by 1769mkc ★ 1.2k • written 3.2 years ago by pomodoro_sinensis ▴ 110

4

Entering edit mode

You could try AGAT, which has a function agat_convert_sp_gxf2gxf.pl to try to add missing features to a GTF or GFF3 file. I think example 9 on the main page is similar to what you want to do.

ADD REPLY • link 3.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you! Yes, that is exactly what I needed.

ADD REPLY • link 3.2 years ago by pomodoro_sinensis ▴ 110

4

Entering edit mode

You could try to run it through one of the script/tools in from AGAT, that might fill in 'missing lines' . I think that agat_convert_sp_gxf2gxf.pl is a good option.

More info also in this post :AGAT - Another Gff Analysis Toolkit

ADD REPLY • link 3.2 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you so much!! If I may, I am going to paste the question I made to rpolicastro on the same topic:

Thank you! Yes, that is exactly what I needed. However the function does not seem to be working for me. Here is the output that I get after executing it:

    Filtering ontology:
        We found 1757 terms that are sequence_feature or is_a child of it.
-------------------------------- parse features --------------------------------
=> GFF version parser used: 2
********************************************************************************
*                               - End parsing -                                *
*                             done in 467 seconds                              *
********************************************************************************

********************************************************************************
*                               - Start checks -                               *
********************************************************************************
---------------------------- Check1: feature types -----------------------------
----------------------------------- ontology -----------------------------------
All feature types in agreement with the Ontology.
------------------------------------- agat -------------------------------------
AGAT can deal with all the encountered feature types (3rd column)
------------------------------ done in 0 seconds -------------------------------

------------------------------ Check2: duplicates ------------------------------
None found
------------------------------ done in 0 seconds -------------------------------

-------------------------- Check3: sequential bucket ---------------------------
Nothing to check as sequential bucket!
------------------------------ done in 0 seconds -------------------------------

--------------------------- Check4: l2 linked to l3 ----------------------------

And then it stops there.... no more output. The gtf looks fine, do you have any experience with this program? Thank you!!

ADD REPLY • link 3.2 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

Did it crash? How long dit it run? I recently added some extra check very specific at this blocking step, maybe I added something not efficient at al in term of calcul, and you end up in that case. What version of AGAT do you use ? Could you share your file ?

ADD REPLY • link 3.2 years ago by Juke34 8.5k

0

Entering edit mode

Hi Juke34. It seemed like it wasn't working but it ended up running! It was just that it took very long (~5 hours). Here is the output I got from that section:

--------------------------- Check4: l2 linked to l3 ----------------------------
88226 cases fixed where L3 features have parent feature(s) missing
---------------------------- done in 17232 seconds -----------------------------

I think this step was precisely solving the issue I had: the lack of transcript feature for each group of exons corresponding to a transcript.

Thanks so much for your reply and thank you so much for making this software, it is fantastic!

Edit: the gtf looks fine to me, but it seems like Kallisto cannot use it to make a Kallisto index. Here is a portion of the errors I get:

2021-02-05 11:01:58,563] WARNING Failed to parse GTF attributes of entry: X protein_coding  CDS 57299460    57299661    .   -   0   ID=cds-72867;Parent=ENSMUST00000033468;exon_number=5;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,564] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57302310    57302431    .   -   2   ID=cds-72866;Parent=ENSMUST00000033468;exon_number=4;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,565] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57304566    57304650    .   -   0   ID=cds-72865;Parent=ENSMUST00000033468;exon_number=3;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,565] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57336952    57337035    .   -   0   ID=cds-72864;Parent=ENSMUST00000033468;exon_number=2;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

[2021-02-05 11:01:58,566] WARNING Failed to parse GTF attributes of entry: X    protein_coding  CDS 57338338    57338574    .   -   0   ID=cds-72863;Parent=ENSMUST00000033468;exon_number=1;gene_biotype=protein_coding;gene_id=ENSMUSG00000031133;gene_name=Arhgef6;protein_id=ENSMUSP00000033468;transcript_id=ENSMUST00000033468;transcript_name=Arhgef6-001

I assume the conversion your software makes is not compatible with kb ref?

Edit2: I found this link that mentions the gtf format kb ref needs:

https://github.com/pachterlab/kb_python/issues/48

I quote: "Specifically, kb looks for gene_id gene_name transcript_id GTF attributes, and these attributes must be semicolon-separated list of tag-value pairs separated by a single space."

I observed that the pairs are separated by "=", so I ran the following command:

sed 's/=/ /g' mm10.converted.gtf > mm10.converted_equal.gtf

Here is an example of the new lines:

MG4209_PATCH    protein_coding  start_codon 45949669    45949671    .   +   0   ID start_codon-40939;Parent ENSMUST00000180606;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097528;gene_name AC233742.1;transcript_id ENSMUST00000180606;transcript_name AC233742.1-201
MG4209_PATCH    protein_coding  gene    46156520    46184432    .   -   .   ID nbis-gene-22893;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;protein_id ENSMUSP00000137714;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  mRNA    46156520    46184432    .   -   .   ID ENSMUST00000180451;Parent nbis-gene-22893;exon_number 1;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;protein_id ENSMUSP00000137714;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  exon    46156520    46156586    .   -   .   ID ENSMUSE00001120372;Parent ENSMUST00000180451;exon_id ENSMUSE00001120372;exon_number 7;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201
MG4209_PATCH    protein_coding  exon    46157336    46157453    .   -   .   ID ENSMUSE00001098262;Parent ENSMUST00000180451;exon_id ENSMUSE00001098262;exon_number 6;gene_biotype protein_coding;gene_id ENSMUSG00000097238;gene_name AC212990.1;transcript_id ENSMUST00000180451;transcript_name AC212990.1-201

PS: I know I need to change "mRNA" by "transcript" according to the original error message of this post, but this doesn't seem to be the origin of this issue...

ADD REPLY • link 3.2 years ago by pomodoro_sinensis ▴ 110

2

Entering edit mode

The default output from agat_convert_sp_gxf2gxf.pl is a GFF file. Please use agat_convert_sp_gff2gtf.pl on the newly created file to make a proper GTF file.

ADD REPLY • link 3.2 years ago by Juke34 8.5k

0

Entering edit mode

can this be used to add entrez ID feature to existing gtf file which don;t have entrez ID ?

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

Do you talk about AGAT? AGAT is not designed to retrieve entrez ID. If you have them in a tsv file with the corresponding features ID ( the same as hold in you GTF/GFF file), then yes AGAT can attach information.

ADD REPLY • link 9 months ago by Juke34 8.5k

0

Entering edit mode

yes I have made tsv file where I have entrez ID and then ensembl ID for respective entrezID. . In the gtf file it looks something like this

    #!genome-build ROS_Cfam_1.0
#!genome-version ROS_Cfam_1.0
#!genome-date 2020-09
#!genome-build-accession GCA_014441545.1
#!genebuild-last-updated 2020-10
X       ensembl gene    24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; gene_source "ensembl"; gene_biotype "protein_coding";
X       ensembl transcript      24550462        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
X       ensembl exon    24552206        24552226        .       -       .       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSCAFE00845128634"; exon_version "1"; tag "Ensembl_canonical";
X       ensembl CDS     24552206        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSCAFP00845021332"; protein_version "1"; tag "Ensembl_canonical";
X       ensembl start_codon     24552224        24552226        .       -       0       gene_id "ENSCAFG00845015183"; gene_version "1"; transcript_id "ENSCAFT00845027108"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

and My reference file that holds the feature mean entrezID which I want to add looks like this

entrezID   dbXrefs
399518   VGNC:VGNC:43937|Ensembl:ENSCAFG00845006432
399530   VGNC:VGNC:40330|Ensembl:ENSCAFG00845002136
399544   VGNC:VGNC:48329|Ensembl:ENSCAFG00845029798
399545   VGNC:VGNC:41624|Ensembl:ENSCAFG00845011460
399653   VGNC:VGNC:44358|Ensembl:ENSCAFG00845001610
403152   -
403157   VGNC:VGNC:38980|Ensembl:ENSCAFG00845013158
403168   VGNC:VGNC:37699|Ensembl:ENSCAFG00845014982
403169   VGNC:VGNC:54270

Will this format work with AGAT ?

Is this the resource

https://agat.readthedocs.io/en/latest/tools/agat_sq_add_attributes_from_tsv.html

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k

1

Entering edit mode

agat_sq_add_attributes_from_tsv.pl can help but you will need the gene_id in the ID column as explainer here:
https://github.com/NBISweden/AGAT/blob/f564798d63a3a0f326fe81f22533b92cc31ecddf/bin/agat_sq_add_attributes_from_tsv.pl#L149C1-L162

ADD REPLY • link 9 months ago by Juke34 8.5k

0

Entering edit mode

"the gene_id in the ID column as explainer here" bit confused about this, so I have to make my reference file as shown in above link?

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

You must add as first column of your tsv file the unique identifier used to identify the features in your gtf which are the values attached to the gene_id attributes.

ADD REPLY • link 9 months ago by Juke34 8.5k

0

Entering edit mode

Going by the example in the agat_sq_add code for the input tsv it is as such

* input.tsv:
ID  annot_type1
gene1   anot_x
cds1    anot_y

Now In my case how should I do , since I want to put the entrezID into the gtf file which are unique identifier at the same time it should match the respective "ENSCAFG" for each entrez id in the gtf file. So i should change my reference file 2nd column to the first column and name it as gene_id to match the gtf. That should work I suppose

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

Right but you should also remove VGNC:VGNC:43937|Ensembl: extra info and keep only the identifier e.g. ENSCAFG00845006432

ADD REPLY • link 9 months ago by Juke34 8.5k

0

Entering edit mode

thank you for the clarification

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k

0

Entering edit mode

I created the file as you suggested now I do see there are rows which didn't have entrez ID mapping so basically those lines are empty. So I should also filter those empty rows to be the right input?

ADD REPLY • link 9 months ago by 1769mkc ★ 1.2k