How to extract organisms name from the headers in a multi-fasta file?
1
1
Entering edit mode
5.1 years ago
MB ▴ 50

Hi, I am trying to extract organisms' names from the headers in a multi-fasta file named input.fa shown below:

>KZR5864_Org_name_nam_strain.11
GHTKKLACWQRTTAAFFGYYWOPPEEDSSSSLKKDDIIPFTQWENMAATGGFDMLLAAPP
>OIA4716.3_Org_other_name_bla_bla
AHHTTIPLNCCWWETRQKLLSSNNNMTIPAHGFSSLLKANCDSM
>SMAR_08120_Other_org_name_bla
AGTHHKKLAMNCWTQEREYPPILLSSDFMNCCVTTQQLAK

what I want is to obtain is the organism name in the header. I have tried the following sed command but I am unable to check for the alphanumerics, therefore, I am also getting the digits after the first underscore like in third header.

sed -eT -e 's|_|&\n|;D' input.fa > out.txt

Expected results:

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla

Please tell me how to obtain org names only. Thanks!

Fasta Regex Sed Header • 2.6k views
ADD COMMENT
2
Entering edit mode

with sed:

$ sed -rn 's/.*(org\.*)/\1/pgi' test.txt  (or)
$ sed -n '/>/ s/.*[0-9]_//p' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
org_name_bla

with awk:

$ awk '/>/ {sub(".*[0-9]_","",$0);print}' test.txt

Org_name_nam_strain.11
Org_other_name_bla_bla
Other_org_name_bla
ADD REPLY
0
Entering edit mode

Are these all of the possible formats for your FASTA headers? I ask because the regex from either sed, Perl, or awk won't really matter if there is more variety than what you show in your example FASTA headers. The regex using any of these programs has to be exactly tailored to the input, which if I had to guess is more diverse than what you show here.

ADD REPLY
0
Entering edit mode

Yes, these are all the possible formats for fasta headers in the input file.

ADD REPLY
1
Entering edit mode

This works with your example. I can't guarantee it will work with all of the lines.

grep ">" input.fa |perl -pe "s/>\w+\.\d+\_(.+)/\1/"|perl -pe "s/>[A-Za-z0-9]+_(.+)/\1/"|perl -pe "s/[0-9]+_(.+)/\1/"
ADD REPLY
0
Entering edit mode

Thanks, it worked! It is giving all the organisms names.

ADD REPLY
1
Entering edit mode

Or:

grep '>' input.fa | sed 's/.*[0-9]_//'

Grep picks lines starting with '>' and sed removes everything before "[0-9]_" (including match).

ADD REPLY
0
Entering edit mode

It worked too! Thanks!

ADD REPLY
1
Entering edit mode
5.1 years ago

So the organism name is everything which follows after a number and an underscore.

$ grep -oP '(?<=[0-9]_).*' input.fa
  • -o forces grep to return only the match and not the whole line
  • -P activate perl regular expression which is needed for the positive look behind
ADD COMMENT

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6