Help for changing fasta seq ID
3
1
Entering edit mode
8.5 years ago
vtefnfqp ▴ 10

Hi, everyone! I want to change the fasta seq ID like follow:

I have 300 seqs like this:

>CDM35446 CDM35446.1 Acyl-CoA N-acyltransferase [Penicillium roqueforti FM164].
MASSSIFPFHVGEASNER.................

I want to change the seq ID to:

>CDM35446|Penicillium_roqueforti
MASSSIFPFHVGEASNER.................

like this: ID|species_name.

I know a simple perl script will fix this, but it's really not easy for me to write script. I really appreciate it if anyone can help me. Also, you can send the script to me: <REMOVED>.

perl script fasta • 2.7k views
ADD COMMENT
0
Entering edit mode

Please do not open with a request to take the discussion off the forum. I have removed that part of your post.

ADD REPLY
5
Entering edit mode
8.4 years ago
george.ry ★ 1.2k

If the ID lines are always in the same format, and have 2 word 'genus species' latin names, as per the example:

cat yourfile.fa | sed 's/^>\([[:alnum:]]*\).*\[\([[:alpha:]]* [[:alpha:]]*\).*/>\1|\2/' > yournewfile.fa
ADD COMMENT
0
Entering edit mode

Great, although I think you should have some word boundaries in there...

ADD REPLY
0
Entering edit mode
8.4 years ago

You need to discover a common pattern to do what you are asking

For example, if you want the first word (CDM35446) which is followed by a tab or empty space, and then add as the next two words (gender and specie) what is contained between the first set of brackets [], you can do it

But if the gender and species is not always contained between brackets, or the information of you fasta sequence is not columnar, this is a hard task to accomplish

ADD COMMENT
0
Entering edit mode
8.4 years ago

Dear vtefnfqp,

The one liner above is great and works perhaps. Here is my commented version. I tested the script with a file and it works. Save the below script in a .pl file and run in the same directory of your fasta file. Change the extension from fa to txt if you want or vice versa in the script.:

#!/usr/bin/perl
use strict;
use warnings;

#open your file which is in the same location of the script.
open(my $fastafile, '<',"./sampleFile.txt");

#initiate an empty array which will contain each line of your file.
my @fasta_array;

#read your file line by line. Then push the lines in the array above.
while(<$fastafile>) {
    push(@fasta_array,$_);
}

#if one of the lines start with greater sign do the conversion you want based on the following regex.
for (my $i =0;$i<scalar(@fasta_array);$i++) {
    if($fasta_array[$i] =~ />([A-Z0-9]+)\s.*\[(.*)\s.*\]/gi) {
        $fasta_array[$i] = $1."|".$2;
    }
}

#combine the array elements with new lines. Store them in a variable.
my $result = join("\n",@fasta_array);

#initiate a file handle which will contain your result.
open(my $resultfile, '>',"./resultFile.txt");

#write your result to file.
print $resultfile $result;

I hope this is helpful,

Good luck with your research,

ADD COMMENT

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6