Question

How to rearrange fasta headers

0

Entering edit mode

23 months ago

v.berriosfarias ▴ 140

Hello! I'm building a database of a certain gene family. I downloaded the fastas from uniprot , concatenated the resulting fastas using cat and the fasta headers of each sequence have the following format:

> tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1

I'm performing an alignment with mmseqs2 and I need that the gene information (the GN= part) is the first string after the first pipe sign (|) on each fasta header. is there a way to do that using awk or R string manipulation?

I want that all my fasta headers have as first string just after the first pipe sign, the GN='gene name' part.

the expected result of each fasta header is the following:

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 PE=3 SV=1

Thanks for your time

fasta R bash string • 885 views

ADD COMMENT • link updated 23 months ago by cpad0112 21k • written 23 months ago by v.berriosfarias ▴ 140

1

Entering edit mode

check if this works:

$ awk -F '[| ]' '/^>/ {$3=$11"|";$11="";$2=$2"|"; gsub(/\| /,"|",$0)}1' test.fa

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442  PE=3 SV=1
atgc

$ sed -r '/^>/ s/(tr\|)(.*\|)(.*)(GN=\w+)(.*)$/\1\4\|\3\5/' test.fa

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442  PE=3 SV=1
atgc

Replace sed with gsed on MacOS.

ADD REPLY • link 23 months ago by cpad0112 21k

score 1 · Answer 1 · 2022-05-17

1

Entering edit mode

23 months ago

JC 13k

Can you explain what do you need exactly? Just to extract the field you can use Perl RegEx:

$ echo "tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1" | perl -lne 'print $1 if (/GN=(\w+)/)'
nidA3

ADD COMMENT • link 23 months ago by JC 13k

0

Entering edit mode

Thanks for your reply, sorry if I didn't explain well, the issue is that I have 334 sequences that are under a single fasta file. Each sequence on their fasta headers (starting with '>') have information about the sequence's gene identity. That information is given by the GN section but it appears at almost the end of the sequence.

What I need is that the gene information appears just after the first pipe sign on each one of the 334 fasta headers.

ADD REPLY • link 23 months ago by v.berriosfarias ▴ 140

0

Entering edit mode

thanks, that is more clear, you can do with:

$ echo "tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1" | perl -pe 'if(/(GN=\w+)/) { $id=$1; s/\|/|$id|/}'
tr|GN=nidA3|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1

ADD REPLY • link 23 months ago by JC 13k