How to rearrange fasta headers
1
0
Entering edit mode
23 months ago

Hello! I'm building a database of a certain gene family. I downloaded the fastas from uniprot , concatenated the resulting fastas using cat and the fasta headers of each sequence have the following format:

> tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1

I'm performing an alignment with mmseqs2 and I need that the gene information (the GN= part) is the first string after the first pipe sign (|) on each fasta header. is there a way to do that using awk or R string manipulation?

I want that all my fasta headers have as first string just after the first pipe sign, the GN='gene name' part.

the expected result of each fasta header is the following:

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 PE=3 SV=1

Thanks for your time

fasta R bash string • 885 views
ADD COMMENT
1
Entering edit mode

check if this works:

$ awk -F '[| ]' '/^>/ {$3=$11"|";$11="";$2=$2"|"; gsub(/\| /,"|",$0)}1' test.fa

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442  PE=3 SV=1
atgc

$ sed -r '/^>/ s/(tr\|)(.*\|)(.*)(GN=\w+)(.*)$/\1\4\|\3\5/' test.fa

> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442  PE=3 SV=1
atgc

Replace sed with gsed on MacOS.

ADD REPLY
1
Entering edit mode
23 months ago
JC 13k

Can you explain what do you need exactly? Just to extract the field you can use Perl RegEx:

$ echo "tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1" | perl -lne 'print $1 if (/GN=(\w+)/)'
nidA3
ADD COMMENT
0
Entering edit mode

Thanks for your reply, sorry if I didn't explain well, the issue is that I have 334 sequences that are under a single fasta file. Each sequence on their fasta headers (starting with '>') have information about the sequence's gene identity. That information is given by the GN section but it appears at almost the end of the sequence.

What I need is that the gene information appears just after the first pipe sign on each one of the 334 fasta headers.

ADD REPLY
0
Entering edit mode

thanks, that is more clear, you can do with:

$ echo "tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1" | perl -pe 'if(/(GN=\w+)/) { $id=$1; s/\|/|$id|/}'
tr|GN=nidA3|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1
ADD REPLY

Login before adding your answer.

Traffic: 2378 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6