Sort header from the multifasta sequnce file into only gene name
1
0
Entering edit mode
4.1 years ago
k.kathirvel93 ▴ 300

I have multifasta protein sequences with long headings, but i want to exclude everything and keep only gene name which appears after 'GN= '. Can anyone help with this pls....

sp|P0AGM2|YICG_ECOLI UPF0126 inner membrane protein YicG OS=Escherichia coli (strain K12) OX=83333 GN=yicG PE=1 SV=1

MLLHILYLVGITAEAMTGALAAGRRRMDTFGVIIIATATAIGGGSVRDILLGHYPLGWVK HPEYVIIVATAAVLTTIVAPVMPYLRKVFLVLDALGLVVFSIIGAQVALDMGHGPIIAVV AAVTTGVFGGVLRDMFCKRIPLVFQKELYAGVSFASAVLYIALQHYVSNHDVVIISTLVF GFFARLLALRLKLGLPVFYYSHEGH

sp|P64442|YCEO_ECOLI Uncharacterized protein YceO OS=Escherichia coli (strain K12) OX=83333 GN=yceO PE=1 SV=1

MRPFLQEYLMRRLLHYLINNIREHLMLYLFLWGLLAIMDLIYVFYF

I want output like this (with > symbol)

yicG

MLLHILYLVGITAEAMTGALAAGRRRMDTFGVIIIATATAIGGGSVRDILLGHYPLGWVK HPEYVIIVATAAVLTTIVAPVMPYLRKVFLVLDALGLVVFSIIGAQVALDMGHGPIIAVV AAVTTGVFGGVLRDMFCKRIPLVFQKELYAGVSFASAVLYIALQHYVSNHDVVIISTLVF GFFARLLALRLKLGLPVFYYSHEGH

yceO

MRPFLQEYLMRRLLHYLINNIREHLMLYLFLWGLLAIMDLIYVFYF

Note : Sorry > symbol was there in all the fasta headers but its not appearing in biostars ( May be i don't know how to post)

Thanks in advance

genome alignment sequencing sequence gene • 1.6k views
ADD COMMENT
1
Entering edit mode

you can also try this:

sed  '/^>/ s/.*\sGN=\(.*\)\sPE.*/>\1/g' test.fa
ADD REPLY
2
Entering edit mode
4.1 years ago
gayachit ▴ 200

Very rough solution but should work:

sed -e 's/.*GN=/>/' -e 's/PE=.*//g' your_file.txt > new_file.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6