Question: Help for changing fasta seq ID
gravatar for vtefnfqp
3.5 years ago by
vtefnfqp10 wrote:

Hi, everyone! I want to change the fasta seq ID like follow:

I have 300 seqs like this:

>CDM35446 CDM35446.1 Acyl-CoA N-acyltransferase [Penicillium roqueforti FM164].


I want to change the seq ID to:



like this: ID|species_name.


I know a simple perl script will fix this, but it's really not easy for me to write script.  I really appreciate it if anyone can help me. Also, you can send the script to me: <REMOVED>.

script perl fasta • 1.3k views
ADD COMMENTlink modified 3.5 years ago by RamRS21k • written 3.5 years ago by vtefnfqp10

Please do not open with a request to take the discussion off the forum. I have removed that part of your post.

ADD REPLYlink written 3.5 years ago by RamRS21k
gravatar for george.ry
3.5 years ago by
United Kingdom
george.ry1.1k wrote:

If the ID lines are always in the same format, and have 2 word 'genus species' latin names, as per the example:

cat yourfile.fa | sed 's/^>\([[:alnum:]]*\).*\[\([[:alpha:]]* [[:alpha:]]*\).*/>\1|\2/' > yournewfile.fa


ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by george.ry1.1k

Great, although I think you should have some word boundaries in there...

ADD REPLYlink written 3.5 years ago by Matt Shirley8.9k
gravatar for Antonio R. Franco
3.5 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

You need to discover a common pattern to do what you are asking

For example, if you want the first word (CDM35446) which is followed by a tab or empty space, and then add as the next two words (gender and specie) what is contained between the first set of brackets [], you can do it

But if the gender and species is not always contained between brackets, or the information of you fasta sequence is not columnar, this is a hard task to accomplish

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Antonio R. Franco4.0k
gravatar for Ibrahim Tanyalcin
3.5 years ago by
Ibrahim Tanyalcin930 wrote:

Dear vtefnfqp,

The one liner above is great and works perhaps. Here is my commented version. I tested the script with a file and it works. Save the below script in a .pl file and run in the same directory of your fasta file. Change the extension from fa to txt if you want or vice versa in the script.:

use strict;
use warnings;

#open your file which is in the same location of the script.
open(my $fastafile, '<',"./sampleFile.txt");

#initiate an empty array which will contain each line of your file.
my @fasta_array;

#read your file line by line. Then push the lines in the array above.
while(<$fastafile>) {

#if one of the lines start with greater sign do the conversion you want based on the following regex.
for (my $i =0;$i<scalar(@fasta_array);$i++) {
    if($fasta_array[$i] =~ />([A-Z0-9]+)\s.*\[(.*)\s.*\]/gi) {
        $fasta_array[$i] = $1."|".$2;

#combine the array elements with new lines. Store them in a variable.
my $result = join("\n",@fasta_array);

#initiate a file handle which will contain your result.
open(my $resultfile, '>',"./resultFile.txt");

#write your result to file.
print $resultfile $result;


I hope this is helpful,

Good luck with your research,

ADD COMMENTlink written 3.5 years ago by Ibrahim Tanyalcin930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1375 users visited in the last hour