How can I extract protein sequence from gff file?
1
1
Entering edit mode
3 months ago

I have a gff file containing genes predicted by AUGUSTUS, the file already containing CDS, exons, and protein sequences. I need to extract protein sequence from the file using bash.

sequence bash gff • 428 views
0
Entering edit mode

You can use awk. Replace 1 with the column number where your sequences are.

awk '{print $1}' my.GFF  ADD REPLY 1 Entering edit mode I am a beginner I don't understand why there is a sequence already in the gff format. however, I need to extract the sequence itself to map it You can check a screenshot https://drive.google.com/file/d/1OEw1g0Ayjr7a7yOyPVG0Fsz7rdjmDEPu/view?usp=sharing ADD REPLY 0 Entering edit mode Please do not post the images of the data instead of posting data. ADD REPLY 4 Entering edit mode 3 months ago Oh, I see. Strange, and more complicated output. Not impossible though. I'm assuming you want the hash (#) and space removed from the beginning of each protein sequence as well. Let me know if this does the trick for you:  awk '/# protein sequence/{a=1}/# Evidence/{a=0}a' Genes.gff | sed 's/# //'  • /# protein sequence/ matches lines having this text, as well as /# Evidence/ does. • /# protein sequence/{a=1} sets the flag when the text # protein sequence is found. • /# Evidence/{a=0} unsets the flag when the text /# Evidence is found. • The final a is a pattern with the default action, which is to print$0: if flag is equal 1 the line is printed.
• Finally, sed removes the hash and space from the beginning of each line.
1
Entering edit mode

Yes it works. Thank you so much!