Extract specific information from headers of fasta file
1
0
Entering edit mode
6.4 years ago
Crystal ▴ 50

Hi,

I know this is stupid, I posted similar question before, but I need a little modification to the code to get the right information.

This is the format of header.

>AAA23421(AI041) fim41, [Escherichia coli]

I need to extract only "AAA23421(AI041)" part from the header. The length of this part differs for sequences in this fasta file.

I tried to modify and use this code:

grep -Po -e ">.*?\)" fileName.fa | sed 's/^>//g' >file1.txt

but it didn't work.

Can anyone help with this?

Thanks

Crystal

sequence • 2.0k views
1
Entering edit mode
perl -lne 'if(/>(.*?)$$(.*?)$$ /){print "$1($2)"}' fileName.fa

(.*?) - anything of any length

0
Entering edit mode

Thank you so much! This code works, too.

Crystal

0
Entering edit mode

Actually the code I modified DO work on the server!! I ran it at the wrong place before.

Sorry for any confusion.

Crystal

1
Entering edit mode
6.4 years ago

I just replied with a link to the extracted results and an explanation of the sed command that I posted in the other thread. The command you just tried works on the dataset you posted to dropbox in the other thread,