Question: How to get proteins from GFF file resulted from MAKER annotation
0
gravatar for imda
5 weeks ago by
imda0
imda0 wrote:

Hi all,

I already have annotated my plant genome using MAKER. I used some scripts from MAKER to recover the annotations in GFF3 format and transcripts and proteins in a fasta file. Nevertheless, I used quality_filter.pl script to clean those genes models that could result as false positive and I created a new GFF3 file to get only the gene models with an AED score of <0.5. Do you know a script to recover the proteins and transcripts from this new GFF3 file with only the good models?

Ivan PhD student UNAM

fasta gff3 maker assembly genome • 237 views
ADD COMMENTlink modified 5 weeks ago by jean.elbers740 • written 5 weeks ago by imda0

Thank you for your answer, do you have to install GASS to run the script? or just copy the script?

best

ADD REPLYlink written 5 weeks ago by imda0

I am trying and I think all is running good

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by imda0
1

Btw as you can run tools from that repo I suggest you use maker_merge_outputs_from_datastore.pl to Merge output from Maker.

ADD REPLYlink written 5 weeks ago by Juke-342.0k

Hi Juke,

MAKER has a very similar script gff_merge.pl) to merge all the outputs. BTW I have a question, do you know why my proteins sequences have a * at the end of the sequence?

Datura20685-RA gene=Datura_stramonium18417 name=Datura_stramonium18417 seq_id=opera_scaffold_643_pilon type=cds MKCFREVTIQRVAKGTDGPSNGKLSRQMICQTDPGNEDRTIHPSRGSGISSFFSFNIKPR MEKGNGHETKEIIEIPKYITIRRVLGLLMANTDGDEKKKVISLGIGDPTAYSCFRTTDSA KEVVANSLVCGTYDGYAPAVGLPRTREAIADYLSRDIPYSISGNDVYVTAGCTQAIEIVL TILARPGGNILLPRPGFSIYSLCAAFRNLEIRYYDLLPEKGWEVDLTAVETLANENTVGL VVINPGNPCANVYSHQHLEKIAQTAKRLQTIVVADEVYGHLAFGENPFIPMAAFSSLVPV VTLGSLSKRWLVPGWRLGWFVINDPNYIYVNPKAAVPAIIEQTTETFYQKTISLLKQTSN ICYEKIKEIPSLICPCKPQGAMFLMSSSLEEALARVKCFCLRHTKRDGGHIPSIN*

ADD REPLYlink written 4 weeks ago by imda0
1

Try both and you will see the difference ;). The main reason I prefer maker_merge_outputs_from_datastore.pl beside the fact that the result is more complete (It collects all tracks not only the maker annotation, it collects proteins and transcripts too in one go, it backups the .ctl files because we usually lost that information, it computes statistics of the annotation ) the script looks directly in the genome_datastore folder while gff_merge passes by the master_datastore_index.log file. This is important because I have seen many times when using MPI that the master_datastore_index.log file was ´corrupted´, consequently some annotations were missing.

Btw from GAAS you could also use the maker_check_progress.sh tool to be sure that your annotation has really completed successfully. Just launch it in the folder where you have run MAKER.

ADD REPLYlink written 4 weeks ago by Juke-342.0k

I already have used both scripts, and the script from GAAS is terrific, I got a lot of information. There is a part where it says about the gene coverage percent, in my case the number was 10. Does this mean that the genes cover only 10 % of the total genome size?

Best

ADD REPLYlink written 4 weeks ago by imda0
1

Hi imda, Yes it does.

ADD REPLYlink written 4 weeks ago by Juke-342.0k

They are stop codons. MAKER include stop codon within the CDS. You could get rid of them by using ‘- -cfs’ option while using the script (stand for: Clean Final Stop).

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Juke-342.0k
2
gravatar for Juke-34
5 weeks ago by
Juke-342.0k
Sweden
Juke-342.0k wrote:

I'm using this one: gff3_sp_extract_sequences.pl from the GAAS repository.

gff3_sp_extract_sequences.pl -gff myfFile.gff -f fastaFile -p -o myoutput.fa

ADD COMMENTlink written 5 weeks ago by Juke-342.0k

Dear Juke, could you help me to understand this from the script:

-t Define the feature you want to extract the sequnece from. By deafault it's 'cds'. Most common choice are: gene,mrna,exon,cds,trna,three_prime_utr,five_prime_utr. When you chose exon (or cds,utr,etc.), all the exon related to a same L2 feature are attached together before to extract the exon. (It doesnt provide one sequence by exon !!) Is this means that the exons are fragmented or what?

Thank you

ADD REPLYlink written 5 weeks ago by imda0

It just means that when you have an mRNA made from 3 exons (multi exons structure are common in eukaryotes), using ‘-t exon’ will not extract the 3 exons independently (as there are 3 exon features in your gff, one by line) but rather recreate the mRNA by concatenating them properly.

ADD REPLYlink modified 4 weeks ago • written 5 weeks ago by Juke-342.0k
2
gravatar for jean.elbers
5 weeks ago by
jean.elbers740
jean.elbers740 wrote:

Here's is a one-liner that requires (seqtk; https://github.com/lh3/seqtk) and pcregrep that first makes the FASTA sequences on multiple lines onto a single line (seqtk seq -l0), then it puts the FASTA header and sequence on the same line separated by a tab (paste - -), next searches for lines with AED between 0.0-0.49 (pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"), finally splits the FASTA header and sequence into two lines by converting the tab into a new-line (tr '\t' '\n').

Note that --buffer-size is not available in all versions of pcregrep (I am using Ubuntu 16.04, pcregrep version 8.38 2015-11-23), the --buffer size option might be needed for very long transcripts

Here is the complete command

seqtk seq -l0 OryPal1.all.fasta.all.maker.proteins.fasta | paste - - | pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]" |tr '\t' '\n' > OryPal1.all.fasta.all.maker.proteins.0-0.49_AED.fasta

OryPal1.all.fasta.all.maker.proteins.fasta is the output from MAKER's fasta_merge program

edit:

grep -e " AED:0.[0-4][0-9]" also works instead of pcregrep --buffer-size 3000000000 " AED:0.[0-4][0-9]"

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by jean.elbers740

Dear Jean I applied your script to get the proteins, and it worked very well, I checked first the number of proteins recovered and the number of genes in my original gff file. I think you can apply this same script to get the transcripts, right?.

Best

ADD REPLYlink written 4 weeks ago by imda0
1

Yes, it works for transcripts also.

ADD REPLYlink written 4 weeks ago by jean.elbers740

Dear Jean, just a doubt, the script considers the stop codons?

ADD REPLYlink written 4 weeks ago by imda0
1

I don’t know about the stop codons. I guess it depends what seqtk does when parsing the * . I won’t be at a computer to check until Monday.

ADD REPLYlink written 4 weeks ago by jean.elbers740
1

I don't have any stop codons in my proteins from MAKER, so I can't see if they are kept with my script, but I did make a test protein FASTA file with an internal stop codon represented by * and my script kept the *. So, yes the script does keep stop codons represented by *.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by jean.elbers740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1993 users visited in the last hour