How to extract organisms name from the headers in a multi-fasta file?
1
1
Entering edit mode
2.8 years ago
MB ▴ 30

Hi, I am trying to extract organisms' names from the headers in a multi-fasta file named input.fa shown below:

>KZR5864_Org_name_nam_strain.11
GHTKKLACWQRTTAAFFGYYWOPPEEDSSSSLKKDDIIPFTQWENMAATGGFDMLLAAPP
>OIA4716.3_Org_other_name_bla_bla
AHHTTIPLNCCWWETRQKLLSSNNNMTIPAHGFSSLLKANCDSM
>SMAR_08120_Other_org_name_bla
AGTHHKKLAMNCWTQEREYPPILLSSDFMNCCVTTQQLAK


what I want is to obtain is the organism name in the header. I have tried the following sed command but I am unable to check for the alphanumerics, therefore, I am also getting the digits after the first underscore like in third header.

sed -eT -e 's|_|&\n|;D' input.fa > out.txt


Expected results:

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla


Please tell me how to obtain org names only. Thanks!

Fasta Regex Sed Header • 1.5k views
2
Entering edit mode

with sed:

$sed -rn 's/.*(org\.*)/\1/pgi' test.txt (or)$ sed -n '/>/ s/.*[0-9]_//p' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
org_name_bla


with awk:

$awk '/>/ {sub(".*[0-9]_","",$0);print}' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla

0
Entering edit mode

Are these all of the possible formats for your FASTA headers? I ask because the regex from either sed, Perl, or awk won't really matter if there is more variety than what you show in your example FASTA headers. The regex using any of these programs has to be exactly tailored to the input, which if I had to guess is more diverse than what you show here.

0
Entering edit mode

Yes, these are all the possible formats for fasta headers in the input file.

1
Entering edit mode

This works with your example. I can't guarantee it will work with all of the lines.

grep ">" input.fa |perl -pe "s/>\w+\.\d+\_(.+)/\1/"|perl -pe "s/>[A-Za-z0-9]+_(.+)/\1/"|perl -pe "s/[0-9]+_(.+)/\1/"

0
Entering edit mode

Thanks, it worked! It is giving all the organisms names.

1
Entering edit mode

Or:

grep '>' input.fa | sed 's/.*[0-9]_//'


Grep picks lines starting with '>' and sed removes everything before "[0-9]_" (including match).

0
Entering edit mode

It worked too! Thanks!

1
Entering edit mode
2.8 years ago

So the organism name is everything which follows after a number and an underscore.

\$ grep -oP '(?<=[0-9]_).*' input.fa

• -o forces grep to return only the match and not the whole line
• -P activate perl regular expression which is needed for the positive look behind