First: The definition of your problem

Question

How to edit fasta headers to keep only ID and organism?

1

Entering edit mode

4.8 years ago

A_heath ▴ 170

Hi all,

I have a multifasta file containing a lot of sequences. All of them have a header as the following example:

>2624749465 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]

I would like to keep only:

>2624749465_Selenomonas_ruminatium

As I'm still learning about bash scripts, I tried with the cut command:

cut -d ' ' -f1 your_file.fa > new_file.fa

So now I have only the gene ID but not the organism. How could I change my command line in order to have both ? Or if you have other suggestions, please let me known!

Thank you in advance for your help

Have a great day

fasta • 4.2k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 4.8 years ago by A_heath ▴ 170

3

Entering edit mode

Search Biostars for "edit fasta header" threads using Google. This is a FAQ question and you should find the right thread.

ADD REPLY • link 4.8 years ago by GenoMax 152k

score 4 · Accepted Answer · 2020-09-24

From your question I deduce that your experience with linux and bash is very limited. As @genomax said you can found a lot of threads in google.

This is not a straight answer but the main hints of how you need to approach this problem in order to advance in your bioinformatics learning.

First: The definition of your problem

First you need to process all the lines that start by ">"
Then in this line you want to change the fasta id from the numeric_id to a numeric_id plus the species name
You need to be sure that all the entries have a genus+species after a "[" (if only genus is known it should have the "sp" for species in order to capture two words after the "[")

Second: start with the core problem and test it

So you need to "capture" some parts of the string and put them together with an "_". This is a problem to be solves with Regular Expressions. I recommend you to read any tutorial about it and use this web to practice: https://regex101.com/

You need to know about "capturing" in regexp

If you put this regexp in the webtool mentioned above and your fasta header in the main text area you will see at the right panel the explanation of the regexp.

^>(\d+)[^\[]+\[(\w+\s\w+)

Here you can see a Perl one liner with the basic logic for your needs. You can do this with Perl, awk or Python and will be more or less the same

This is your toy line to practice with echo (printing the text to parse with the oneliner) a pipe (|) for redirecting the echo to the perl oneliner, and the perl one liner itself

$ echo '>2624749465 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]' | perl -lane '($id, $g, $sp)=$_=~/^>(\d+)[^\[]+\[(\w+)\s(\w+)/; print ">" . join("_", $id, $g,$sp) . " ". join(" ", @F[1..$#F]) '

Explaining the one liner:

# I am capturing the three components I want to join ($id, $g, $sp)=$_=~/^>(\d+)[^\[]+\[(\w+)\s(\w+)/

# then join them together with "_" and adding the ">" at the beginning of the line print ">" . join("_", $id, $g,$sp)

With this you have your id changed, but is nice to leave the full description as the fasta comment so I added the . " ". join(" ", @F[1..$#F])

In Perl one liners the option "-a" split the text by spaces and you can access them from the array @F, in this case I am getting all the elemnts of this array except the first one (index 0) so I made a range ".." from the second element (index 1) to the last element of the array (in per defined with $#F) @F[1..$#F]

Third: create a small test file

 $ cat fasta.fa 
>1 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtggtgtgt
atgtgtagtag
ayhyh
>2 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtacgtacgtagctagctag
acgtagctagctagtcgatcgat

Fourth: create the program

You can use the magic of Perl inline update "-i" flag and change the file in situ and create a backup file with the wanted extension (.back here). So your original code will be in file fasta.back and the updated one in the fasta.fa

 $ perl -i.back  -lane 'if (/^>/) { ($id, $g, $sp)=$_=~/^>(\w+)[^\[]+\[(\w+)\s(\w+)/; print join("_", $id, $g,$sp) . " ". join(" ", @F[1..$#F])} else {print} ' fasta.fa
 $ cat fasta.fa
>1_Selenomonas_ruminatium radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtggtgtgt
atgtgtagtag
ayhyh
>2_Selenomonas_ruminatium radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtacgtacgtagctagctag
acgtagctagctagtcgatcgat

Final remark. I will recommend you to learn Python and bash if you are processing data or doing bioinformatics. I am using Perl here as a better grep/awk and allowing me to do a quick transform in a one line (not the best pedagogic approach, but I as I said at the start, this was not a straight answer). You can learn more about this Perl one liners at https://www.manning.com/books/minimal-perl. Nevertheless I will recommend you to practice creating a Python program with regular expressions and reading line by line your fasta in order to practice your text mining skills.