Question

How to get proteins from GFF file resulted from MAKER annotation

0

Entering edit mode

5.2 years ago

imda ▴ 10

Hi all,

I already have annotated my plant genome using MAKER. I used some scripts from MAKER to recover the annotations in GFF3 format and transcripts and proteins in a fasta file. Nevertheless, I used quality_filter.pl script to clean those genes models that could result as false positive and I created a new GFF3 file to get only the gene models with an AED score of <0.5. Do you know a script to recover the proteins and transcripts from this new GFF3 file with only the good models?

Ivan PhD student UNAM

assembly MAKER genome fasta GFF3 • 11k views

ADD COMMENT • link updated 16 months ago by Juke34 8.6k • written 5.2 years ago by imda ▴ 10

0

Entering edit mode

Thank you for your answer, do you have to install GASS to run the script? or just copy the script?

best

ADD REPLY • link 5.2 years ago by imda ▴ 10

0

Entering edit mode

I am trying and I think all is running good

ADD REPLY • link 5.2 years ago by imda ▴ 10

1

Entering edit mode

Btw as you can run tools from that repo I suggest you use maker_merge_outputs_from_datastore.pl to Merge output from Maker.

ADD REPLY • link 5.2 years ago by Juke34 8.6k

0

Entering edit mode

Hi Juke,

MAKER has a very similar script gff_merge.pl) to merge all the outputs. BTW I have a question, do you know why my proteins sequences have a * at the end of the sequence?

Datura20685-RA gene=Datura_stramonium18417 name=Datura_stramonium18417 seq_id=opera_scaffold_643_pilon type=cds MKCFREVTIQRVAKGTDGPSNGKLSRQMICQTDPGNEDRTIHPSRGSGISSFFSFNIKPR MEKGNGHETKEIIEIPKYITIRRVLGLLMANTDGDEKKKVISLGIGDPTAYSCFRTTDSA KEVVANSLVCGTYDGYAPAVGLPRTREAIADYLSRDIPYSISGNDVYVTAGCTQAIEIVL TILARPGGNILLPRPGFSIYSLCAAFRNLEIRYYDLLPEKGWEVDLTAVETLANENTVGL VVINPGNPCANVYSHQHLEKIAQTAKRLQTIVVADEVYGHLAFGENPFIPMAAFSSLVPV VTLGSLSKRWLVPGWRLGWFVINDPNYIYVNPKAAVPAIIEQTTETFYQKTISLLKQTSN ICYEKIKEIPSLICPCKPQGAMFLMSSSLEEALARVKCFCLRHTKRDGGHIPSIN*

ADD REPLY • link 5.2 years ago by imda ▴ 10

1

Entering edit mode

Try both and you will see the difference ;). The main reason I prefer maker_merge_outputs_from_datastore.pl beside the fact that the result is more complete (It collects all tracks not only the maker annotation, it collects proteins and transcripts too in one go, it backups the .ctl files because we usually lost that information, it computes statistics of the annotation ) the script looks directly in the genome_datastore folder while gff_merge passes by the master_datastore_index.log file. This is important because I have seen many times when using MPI that the master_datastore_index.log file was ´corrupted´, consequently some annotations were missing.

Btw from GAAS you could also use the maker_check_progress.sh tool to be sure that your annotation has really completed successfully. Just launch it in the folder where you have run MAKER.

ADD REPLY • link 5.2 years ago by Juke34 8.6k

0

Entering edit mode

I already have used both scripts, and the script from GAAS is terrific, I got a lot of information. There is a part where it says about the gene coverage percent, in my case the number was 10. Does this mean that the genes cover only 10 % of the total genome size?

Best

ADD REPLY • link 5.2 years ago by imda ▴ 10

1

Entering edit mode

Hi imda, Yes it does.

ADD REPLY • link 5.2 years ago by Juke34 8.6k

0

Entering edit mode

They are stop codons. MAKER include stop codon within the CDS. You could get rid of them by using ‘- -cfs’ option while using the script (stand for: Clean Final Stop).

ADD REPLY • link 5.2 years ago by Juke34 8.6k

score 8 · Accepted Answer · 2019-02-14

8

Entering edit mode

5.2 years ago

Juke34 8.6k

I'm using this one:

agat_sp_extract_sequences.pl from AGAT:

agat_sp_extract_sequences.pl --gff myfFile.gff -f fastaFile.fa -p -o myoutput.fa

ADD COMMENT • link 16 months ago by Juke34 8.6k

0

Entering edit mode

Dear Juke, could you help me to understand this from the script:

-t Define the feature you want to extract the sequnece from. By deafault it's 'cds'. Most common choice are: gene,mrna,exon,cds,trna,three_prime_utr,five_prime_utr. When you chose exon (or cds,utr,etc.), all the exon related to a same L2 feature are attached together before to extract the exon. (It doesnt provide one sequence by exon !!) Is this means that the exons are fragmented or what?

Thank you

ADD REPLY • link 5.2 years ago by imda ▴ 10

0

Entering edit mode

It just means that when you have an mRNA made from 3 exons (multi exons structure are common in eukaryotes), using ‘-t exon’ will not extract the 3 exons independently (as there are 3 exon features in your gff, one by line) but rather recreate the mRNA by concatenating them properly.

ADD REPLY • link 5.2 years ago by Juke34 8.6k

0

Entering edit mode

The fasta file from this run is not compatible with interproscan since it has * and other dots characters which interproscan does not understand/work with

ADD REPLY • link 16 months ago by javanokendo ▴ 60

0

Entering edit mode

Yes you must add the --clean_final_stop parameter or --cfs

ADD REPLY • link 16 months ago by Juke34 8.6k

score 2 · Accepted Answer · 2019-02-15

2

Entering edit mode

5.2 years ago

jean.elbers ★ 1.7k

Here's is a one-liner that requires (seqtk; https://github.com/lh3/seqtk) and pcregrep that first makes the FASTA sequences on multiple lines onto a single line (seqtk seq -l0), then it puts the FASTA header and sequence on the same line separated by a tab (paste - -), next searches for lines with AED between 0.0-0.49 (pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"), finally splits the FASTA header and sequence into two lines by converting the tab into a new-line (tr '\t' '\n').

Note that --buffer-size is not available in all versions of pcregrep (I am using Ubuntu 16.04, pcregrep version 8.38 2015-11-23), the --buffer size option might be needed for very long transcripts

Here is the complete command

seqtk seq -l0 OryPal1.all.fasta.all.maker.proteins.fasta | paste - - | pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]" |tr '\t' '\n' > OryPal1.all.fasta.all.maker.proteins.0-0.49_AED.fasta

OryPal1.all.fasta.all.maker.proteins.fasta is the output from MAKER's fasta_merge program

edit:

grep -e " AED:0.[0-4][0-9]" also works instead of pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"

ADD COMMENT • link 5.2 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Dear Jean I applied your script to get the proteins, and it worked very well, I checked first the number of proteins recovered and the number of genes in my original gff file. I think you can apply this same script to get the transcripts, right?.

Best

ADD REPLY • link 5.2 years ago by imda ▴ 10

1

Entering edit mode

Yes, it works for transcripts also.

ADD REPLY • link 5.2 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Dear Jean, just a doubt, the script considers the stop codons?

ADD REPLY • link 5.2 years ago by imda ▴ 10

1

Entering edit mode

I don’t know about the stop codons. I guess it depends what seqtk does when parsing the * . I won’t be at a computer to check until Monday.

ADD REPLY • link 5.2 years ago by jean.elbers ★ 1.7k

1

Entering edit mode

I don't have any stop codons in my proteins from MAKER, so I can't see if they are kept with my script, but I did make a test protein FASTA file with an internal stop codon represented by * and my script kept the *. So, yes the script does keep stop codons represented by *.

ADD REPLY • link 5.2 years ago by jean.elbers ★ 1.7k