Question: Sort header from the multifasta sequnce file into only gene name
0
gravatar for k.kathirvel93
3 months ago by
k.kathirvel93250
India
k.kathirvel93250 wrote:

I have multifasta protein sequences with long headings, but i want to exclude everything and keep only gene name which appears after 'GN= '. Can anyone help with this pls....

sp|P0AGM2|YICG_ECOLI UPF0126 inner membrane protein YicG OS=Escherichia coli (strain K12) OX=83333 GN=yicG PE=1 SV=1

MLLHILYLVGITAEAMTGALAAGRRRMDTFGVIIIATATAIGGGSVRDILLGHYPLGWVK HPEYVIIVATAAVLTTIVAPVMPYLRKVFLVLDALGLVVFSIIGAQVALDMGHGPIIAVV AAVTTGVFGGVLRDMFCKRIPLVFQKELYAGVSFASAVLYIALQHYVSNHDVVIISTLVF GFFARLLALRLKLGLPVFYYSHEGH

sp|P64442|YCEO_ECOLI Uncharacterized protein YceO OS=Escherichia coli (strain K12) OX=83333 GN=yceO PE=1 SV=1

MRPFLQEYLMRRLLHYLINNIREHLMLYLFLWGLLAIMDLIYVFYF

I want output like this (with > symbol)

yicG

MLLHILYLVGITAEAMTGALAAGRRRMDTFGVIIIATATAIGGGSVRDILLGHYPLGWVK HPEYVIIVATAAVLTTIVAPVMPYLRKVFLVLDALGLVVFSIIGAQVALDMGHGPIIAVV AAVTTGVFGGVLRDMFCKRIPLVFQKELYAGVSFASAVLYIALQHYVSNHDVVIISTLVF GFFARLLALRLKLGLPVFYYSHEGH

yceO

MRPFLQEYLMRRLLHYLINNIREHLMLYLFLWGLLAIMDLIYVFYF

Note : Sorry > symbol was there in all the fasta headers but its not appearing in biostars ( May be i don't know how to post)

Thanks in advance

ADD COMMENTlink modified 3 months ago by gayachit200 • written 3 months ago by k.kathirvel93250
1

you can also try this:

sed  '/^>/ s/.*\sGN=\(.*\)\sPE.*/>\1/g' test.fa
ADD REPLYlink modified 3 months ago • written 3 months ago by cpad011213k
2
gravatar for gayachit
3 months ago by
gayachit200
India
gayachit200 wrote:

Very rough solution but should work:

sed -e 's/.*GN=/>/' -e 's/PE=.*//g' your_file.txt > new_file.txt
ADD COMMENTlink modified 3 months ago • written 3 months ago by gayachit200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour