How can I extract protein sequence from gff file?
1
1
Entering edit mode
2.9 years ago

I have a gff file containing genes predicted by AUGUSTUS, the file already containing CDS, exons, and protein sequences. I need to extract protein sequence from the file using bash.

sequence bash gff • 1.7k views
ADD COMMENT
0
Entering edit mode

You can use awk. Replace 1 with the column number where your sequences are.

awk '{print $1}' my.GFF
ADD REPLY
1
Entering edit mode

I am a beginner I don't understand why there is a sequence already in the gff format. however, I need to extract the sequence itself to map it You can check a screenshot https://drive.google.com/file/d/1OEw1g0Ayjr7a7yOyPVG0Fsz7rdjmDEPu/view?usp=sharing

ADD REPLY
0
Entering edit mode

Please do not post the images of the data instead of posting data.

ADD REPLY
4
Entering edit mode
2.9 years ago

Oh, I see. Strange, and more complicated output. Not impossible though. I'm assuming you want the hash (#) and space removed from the beginning of each protein sequence as well.

Let me know if this does the trick for you:

 awk '/# protein sequence/{a=1}/# Evidence/{a=0}a' Genes.gff | sed 's/# //'  
  • /# protein sequence/ matches lines having this text, as well as /# Evidence/ does.
  • /# protein sequence/{a=1} sets the flag when the text # protein sequence is found.
  • /# Evidence/{a=0} unsets the flag when the text /# Evidence is found.
  • The final a is a pattern with the default action, which is to print $0: if flag is equal 1 the line is printed.
  • Finally, sed removes the hash and space from the beginning of each line.
ADD COMMENT
1
Entering edit mode

Yes it works. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6