Something wrong with my Exonerate gff
2
0
Entering edit mode
2.0 years ago
Mike ▴ 10

Hey all, I used the following command to align my proteome to genome:

exonerate --model protein2genome --showvulgar no --showalignment no --showtargetgff yes --percent 70 protein genome > output.gff

Now the gene_id field is generated without " " around it, so I wrote an awk script to add them but still I encounter a problem when trying to use my gff.

For example when I try to convert it to fasta with TransDecoder I get the error:

Use of uninitialized value $type in string eq at Error, no gene_id at Chr03 exonerate:protein2genome:local exon 10813108 10813326 . + . insertions 0 ; deletions 0 ; identity 89.04 ;

There's a gene id only when the record is of type gene.

Another example is when I try to convert to genepred format using ucsc tool:

gtfToGenePred -allErrors gtf_input output.gp

i get:

Word count less than 8 Bad line 1 of file.gtf:

I also tried changing between gtf & gff and rerunning..

Part of my gtf file:

Chr03 exonerate:protein2genome:local gene 18514887 18517034 2101 - . gene_id "1"   sequence sp|P12459|TBB1_SOYBN   gene_orientation +   identity 88.84   similarity 95.22
Chr03   exonerate:protein2genome:local  cds     18516641        18517034        .       -       .
Chr03   exonerate:protein2genome:local  exon    18516641        18517034        .       -       .       insertions 0 ; deletions 0 ; identity 86.26 ; similarity 93.13
Chr03   exonerate:protein2genome:local  splice5 18516639        18516640        .       -       .       intron_id 1 ; splice_site "GT"
Chr03   exonerate:protein2genome:local  intron  18515928        18516640        .       -       .       intron_id 1
Chr03   exonerate:protein2genome:local  splice3 18515928        18515929        .       -       .       intron_id 0 ; splice_site "AG"
Chr03   exonerate:protein2genome:local  cds     18515658        18515927        .       -       .
Chr03   exonerate:protein2genome:local  exon    18515658        18515927        .       -       .       insertions 0 ; deletions 0 ; identity 93.26 ; similarity 95.51
Chr03   exonerate:protein2genome:local  splice5 18515656        18515657        .       -       .       intron_id 2 ; splice_site "GT"
Chr03   exonerate:protein2genome:local  intron  18515546        18515657        .       -       .       intron_id 2
Chr03   exonerate:protein2genome:local  splice3 18515546        18515547        .       -       .       intron_id 1 ; splice_site "AG"
Chr03   exonerate:protein2genome:local  cds     18514887        18515545        .       -       .
Chr03   exonerate:protein2genome:local  exon    18514887        18515545        .       -       .       insertions 0 ; deletions 0 ; identity 88.58 ; similarity 96.35
Chr03   exonerate:protein2genome:local  similarity      18514887        18517034        2101    -       .       alignment_id 1 ; Query sp|P12459|TBB1_SOYBN ; Align 18517035 1 393 ; Align 18515926 133 267 ; Alig

Any kind of help is highly appreciated.

exonerate genepred gff • 1.8k views
ADD COMMENT
1
Entering edit mode

you should post parts of your gff file, as the problem seems to come from there

ADD REPLY
0
Entering edit mode

Thank you for the reminder and for the wonderful work with the website. I updated the post 🤞🏻

ADD REPLY
1
Entering edit mode

consider to post it as text inline, not a screen shot

ADD REPLY
0
Entering edit mode

Thanks for the suggestion, I updated the post 🤞🏻

ADD REPLY
0
Entering edit mode
2.0 years ago
Malcolm.Cook ★ 1.5k

Your problem is that somehow that tabs in your gff got converted to spaces.

I've used exonerate --showtargetgff previously with success, so I don't think the error lies there.

I suspect your awk script, which you don't show, is doing the aberrant mangling.

ADD COMMENT
0
Entering edit mode

Unfortunately, I get the exact same errors when I use my exonerate output without the awk manipulation.

Any ideas what can I do in this situation?

Thanks a lot.

ADD REPLY
0
Entering edit mode

hmm. do you have another way of giving me a copy of the untouched exonerate output? ftp? google doc?

ADD REPLY
0
Entering edit mode

I'm not sure how to upload the gtf to google doc but I can load it to drive:

https://drive.google.com/file/d/1SbK7gK9k50qjgS809cvsPclwayYWPL5D/view?usp=sharing

Thank you so much for the help.

ADD REPLY
0
Entering edit mode
2.0 years ago
Juke34 8.5k

The GTF file you show here is wrong, no transcript_id or gene_id is present for non-gene features. The GFF format does not require the quote... Attribute values do not need to be and should not be quoted see here. The gene feature line is wierd there are no separator between the different attributes.

You can try AGAT to sandardize your file.
(I checked with your file and it sounds to work fine. You must comment the 2 first line that are not GFF/GTF compliant)

ADD COMMENT
0
Entering edit mode

Sorry for the mixup, the exonerate output looks like this:

Chr03   exonerate:protein2genome:local  gene    18514986        18517034        2078    -       .       gene_id 1 ; sequence sp|P28551|TBB3_SOYBN ; gene_orientation + ; identity 96.55 ; similarity 98.52
Chr03   exonerate:protein2genome:local  cds     18516641        18517034        .       -       .
Chr03   exonerate:protein2genome:local  exon    18516641        18517034        .       -       .       insertions 0 ; deletions 0 ; identity 94.66 ; similarity 96.95
Chr03   exonerate:protein2genome:local  splice5 18516639        18516640        .       -       .       intron_id 1 ; splice_site "GT"
Chr03   exonerate:protein2genome:local  intron  18515928        18516640        .       -       .       intron_id 1
Chr03   exonerate:protein2genome:local  splice3 18515928        18515929        .       -       .       intron_id 0 ; splice_site "AG"
Chr03   exonerate:protein2genome:local  cds     18515658        18515927        .       -       .
Chr03   exonerate:protein2genome:local  exon    18515658        18515927        .       -       .       insertions 0 ; deletions 0 ; identity 97.75 ; similarity 98.88
Chr03   exonerate:protein2genome:local  splice5 18515656        18515657        .       -       .       intron_id 2 ; splice_site "GT"
Chr03   exonerate:protein2genome:local  intron  18515546        18515657        .       -       .       intron_id 2
Chr03   exonerate:protein2genome:local  splice3 18515546        18515547        .       -       .       intron_id 1 ; splice_site "AG"

The one in the post is an output of my awk script that adds " " to the gene id.

So as you can see there are seperators in all lines but I get the same errors.

When you say "I checked with your file and it sounds to work fine", what do you mean exactly? that it works with AGAT or with the tools I mentioned?

ADD REPLY
0
Entering edit mode

I tried with AGAT: agat_convert_sp_gxf2gxf.pl --gff exo_out.gtf
There is a lot of warning because the file is particular but it deals well with it.

ADD REPLY
0
Entering edit mode

Thanks Juke!

I'll give it shot once I have AGAT installed, hopefully it will let me use these tools.

ADD REPLY

Login before adding your answer.

Traffic: 2900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6