How to edit fasta headers to keep only ID and organism?
1
0
Entering edit mode
16 months ago
A_heath ▴ 60

Hi all,

I have a multifasta file containing a lot of sequences. All of them have a header as the following example:

>2624749465 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]


I would like to keep only:

>2624749465_Selenomonas_ruminatium


As I'm still learning about bash scripts, I tried with the cut command:

cut -d ' ' -f1 your_file.fa > new_file.fa


So now I have only the gene ID but not the organism. How could I change my command line in order to have both ? Or if you have other suggestions, please let me known!

Have a great day

fasta headers bash edit • 900 views
1
Entering edit mode

1
Entering edit mode
16 months ago

From your question I deduce that your experience with linux and bash is very limited. As @genomax said you can found a lot of threads in google.

This is not a straight answer but the main hints of how you need to approach this problem in order to advance in your bioinformatics learning.

# First: The definition of your problem

• First you need to process all the lines that start by ">"
• Then in this line you want to change the fasta id from the numeric_id to a numeric_id plus the species name
• You need to be sure that all the entries have a genus+species after a "[" (if only genus is known it should have the "sp" for species in order to capture two words after the "[")

So you need to "capture" some parts of the string and put them together with an "_". This is a problem to be solves with Regular Expressions. I recommend you to read any tutorial about it and use this web to practice: https://regex101.com/

You need to know about "capturing" in regexp

If you put this regexp in the webtool mentioned above and your fasta header in the main text area you will see at the right panel the explanation of the regexp.

^>(\d+)[^\[]+\[(\w+\s\w+)


Here you can see a Perl one liner with the basic logic for your needs. You can do this with Perl, awk or Python and will be more or less the same

This is your toy line to practice with echo (printing the text to parse with the oneliner) a pipe (|) for redirecting the echo to the perl oneliner, and the perl one liner itself

$echo '>2624749465 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]' | perl -lane '($id, $g,$sp)=$_=~/^>(\d+)[^\[]+\[(\w+)\s(\w+)/; print ">" . join("_",$id, $g,$sp) . " ". join(" ", @F[1..$#F]) '  Explaining the one liner: # I am capturing the three components I want to join ($id, $g,$sp)=$_=~/^>(\d+)[^\[]+\[(\w+)\s(\w+)/ # then join them together with "_" and adding the ">" at the beginning of the line print ">" . join("_",$id, $g,$sp)

With this you have your id changed, but is nice to leave the full description as the fasta comment so I added the . " ". join(" ", @F[1..$#F]) In Perl one liners the option "-a" split the text by spaces and you can access them from the array @F, in this case I am getting all the elemnts of this array except the first one (index 0) so I made a range ".." from the second element (index 1) to the last element of the array (in per defined with$#F) @F[1..$#F] # Third: create a small test file $ cat fasta.fa
>1 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtggtgtgt
atgtgtagtag
ayhyh
>2 radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtacgtacgtagctagctag
acgtagctagctagtcgatcgat


# Fourth: create the program

You can use the magic of Perl inline update "-i" flag and change the file in situ and create a backup file with the wanted extension (.back here). So your original code will be in file fasta.back and the updated one in the fasta.fa

 $perl -i.back -lane 'if (/^>/) { ($id, $g,$sp)=$_=~/^>(\w+)[^\[]+\[(\w+)\s(\w+)/; print join("_",$id, $g,$sp) . " ". join(" ", @F[1..$#F])} else {print} ' fasta.fa$ cat fasta.fa
>1_Selenomonas_ruminatium radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtggtgtgt
atgtgtagtag
ayhyh
>2_Selenomonas_ruminatium radical S-adenosyl methionine domain-containing protein 2 [Selenomonas ruminatium S137 : Ga0066891_103]
actgtacgtacgtagctagctag
acgtagctagctagtcgatcgat


Final remark. I will recommend you to learn Python and bash if you are processing data or doing bioinformatics. I am using Perl here as a better grep/awk and allowing me to do a quick transform in a one line (not the best pedagogic approach, but I as I said at the start, this was not a straight answer). You can learn more about this Perl one liners at https://www.manning.com/books/minimal-perl. Nevertheless I will recommend you to practice creating a Python program with regular expressions and reading line by line your fasta in order to practice your text mining skills.